MergeMitra vs CodeRabbit vs Greptile: 2026 AI Code Review Benchmark
MergeMitra vs. CodeRabbit vs. Greptile A controlled, side-by-side evaluation on two large open-source codebases.
Executive Summary
Objective: The objective was to evaluate how three AI code review tools -- MergeMitra, CodeRabbit, and Greptile -- compare at effectiveness of code review on production grade code and find out which tool fits best for enterprise teams.
Context & Approach: Greptile published an open benchmark in 2025 that tested AI review tools against real bugs from large open-source codebases. We adopted the same methodology, using the same PRs and the same real production bugs Greptile used.
We also took two distinctly unique codebases: Cal.com (TypeScript) and Keycloak (Java). For each codebase, we created three mirror repositories on GitHub and installed exactly one code review tool in each. We then opened 10 PRs per mirror, each carrying a real historical bug, for a total of 60 PRs.
Every tool saw byte-identical diffs and ran on default settings with no custom rules, so any difference in results reflects the tool, not the input. All 60 reviews were independently examined by a standard prompt run on Claude Code (with Opus 4.6), with every verdict is linked to verifiable GitHub evidence.
Key Takeaways:
- MergeMitra caught 85% of planted bugs (17/20), compared to CodeRabbit at 65% (13/20) and Greptile at 60% (12/20). Four bugs -- including a Critical SQL injection and a High-severity email blacklist bypass -- were caught only by MergeMitra.
- MergeMitra led on every quality dimension that matters for bug prevention: security, performance, test quality, maintainability, architectural insight, and cross-file reasoning -- all at 80-90% effectiveness versus 30-60% for the other tools.
- Greptile excels at signal-to-noise (zero false positives) but trades breadth for precision, staying silent on many real issues. CodeRabbit offers broad coverage with polished autofix UX but introduces noise that requires triage, including a hallucinated CVE identifier.
Recommendation: For enterprise adoption, MergeMitra is the clear winner as per Claude Code. Full details are available below.
Background
In July 2025, Greptile published an open benchmark for AI code review tools: greptile.com/benchmarks. Their methodology was straightforward and well-designed. They selected five large, real-world open-source repositories - Sentry (Python), Cal.com (TypeScript), Grafana (Go), Keycloak (Java), and Discourse (Ruby) - traced back 10 real bug-fix commits per repo from Git history, and reintroduced the original buggy code as fresh pull requests. They then ran five AI review tools - Greptile, CodeRabbit, Bugbot, Copilot, and Graphite - on those PRs and scored each tool on whether it caught the planted bug.
It was a credible benchmark. Real bugs from real production codebases, identical diffs across all tools, a clear pass/fail scoring criterion. We wanted to use the same methodology to answer a different question: how does MergeMitra compare to Greptile and CodeRabbit?
So we did exactly that. We took the same two codebases from Greptile's benchmark - Cal.com and Keycloak - used the same PRs carrying the same real production bugs, set up our own mirror repositories, and ran the test with three tools: MergeMitra, CodeRabbit, and Greptile.
This report walks through the full process and results.
1. Codebases Under Test
We chose Cal.com and Keycloak from Greptile's original five-repo benchmark because they cover two very different technology stacks and domain pressures:
Cal.com gives us a TypeScript-heavy modern web stack. Keycloak gives us a long-standing Java enterprise system with deep concerns around identity, authorization and cryptography. If a tool performs well on both, the result is meaningful. Together they give us coverage across two languages, two ecosystems, and two very different kinds of domain complexity.
2. How We Set Up the Test
Step 1 - Same bugs, same PRs as Greptile's benchmark
Greptile's benchmark traced back real bug-fix commits from each project's Git history. For each bug, they identified the commit that originally introduced the flawed code and the commit that later fixed it. They then created two branches - one before the bug was introduced and one after - and opened a fresh PR that reintroduced the original buggy change. This meant every PR in the benchmark carried a real, historical production bug: something that was introduced during normal development, ran in production, was eventually reported, diagnosed, and patched by the real maintainers.
We used the same PRs and the same bugs. We did not pick new bugs or create synthetic ones. The bugs span the full range you see in real codebases:
- Authentication and authorization flaws
- Race conditions and concurrency issues
- Performance regressions (N+1 queries, runaway memory)
- Privacy and data leaks
- Lifecycle and migration bugs
- Test-quality regressions
- Maintainability and architecture issues
- Localization and content errors
Step 2 - Three mirror repositories per codebase
For each of the two codebases, we created three clean mirror repositories on GitHub - one per AI review tool - and installed exactly one bot in each:
That gives us 6 repositories in total, with one tool per repository. Same code in all three mirrors of each project. Different reviewer in each.
Step 3 - Open the same 10 PRs in each mirror
We opened the same 10 pull requests in each of the three mirror repositories for each codebase - 60 PRs in total (10 PRs × 3 tools × 2 codebases). Each PR carries one real production bug from that project's history.
Critically, the same PR has a byte-identical diff and the same commit hash in all three mirror repositories. We verified this with the GitHub API (gh api repos/.../pulls/N --jq '.head.sha'). So if one tool found the bug and another did not, the difference comes from the tool, not the diff.
Step 4 - Let the tools run, then collect everything
All three tools ran on their default settings with no custom rules - the same constraint Greptile used in their benchmark. Each tool had full repository access including the PR diff and base branch. Once the reviews were in, we collected every review comment, every inline comment, and every issue comment from all 60 PRs via the GitHub CLI.
3. Validation Methodology
Greptile's original benchmark was scored by Greptile's own team. We wanted an independent evaluator with no relationship to any of the three tools. So we used Claude Code (powered by Claude Opus 4.6, Anthropic's most capable model with one-million-token context) as the validation layer.
For each of the 60 PR reviews, Claude Code:
- Pulled every review comment from the PR via the GitHub API - inline review comments, review summaries, and issue comments.
- Read the actual source code at the exact commit the PR introduced, including surrounding context (20–60 lines above and below the cited line).
- Checked whether the planted bug was caught - a clean yes or no. A bug counted as "caught" only when the tool explicitly identified the faulty code and explained its impact, consistent with Greptile's original scoring criterion. Summary-level mentions or vague warnings without identifying the specific code did not count.
- Produced a verification table with a direct link to the review comment (for catches) or to the PR (for misses), so every result is independently verifiable by clicking the link.
Claude Code wrote one report per codebase. We then consolidated them into this single report with the combined verification tables.
In total, 60 PR reviews were validated against source - every tool's verdict on every PR cross-checked against the actual planted bug. Nothing was taken on the bot's word. Nothing was taken on the validator's word either - every and in the tables below links directly to the GitHub evidence.
4. A Note on Code Review and Bug Detection
Before we share the results, one piece of context: code review is not primarily a bug-catching activity. Every senior engineer knows this. The main jobs of a code review are to:
- Reduce technical debt before it accumulates.
- Enforce architectural and security best practices.
- Improve long-term maintainability of the codebase.
- Spread knowledge across the team.
Some bugs do get caught in code review - usually shallow ones, near the surface - but most functional bugs are caught by automated tests, QA, and production monitoring. So when we ask "can an AI reviewer catch the bug?" we are deliberately stress-testing these tools on a dimension where even experienced human reviewers often struggle.
We chose to measure bug-finding precisely because it is hard. A tool that can do this - on top of the maintainability, architecture and security hygiene that code review is normally for - is genuinely valuable. A tool that can only do the easy stuff is not.
The results below reflect that stiff test.
5. Results - Cal.com (TypeScript)
The 10 Cal.com PRs cover scheduling, OAuth integrations, calendar sync, workflows, two-factor authentication, and a major UI feature. means the tool successfully flagged the planted bug. means it missed it. Each links directly to the review comment where the tool flagged the bug. Each links to the PR so readers can verify nothing was found.
Bug #7 (timing attack) was missed by all three tools. MergeMitra caught every other bug - including the SQL injection risk (Bug #4) and the case-sensitivity blacklist bypass (Bug #9) that both other tools missed.
6. Results - Keycloak (Java)
The 10 Keycloak PRs cover Keycloak's authentication, authorization, cryptography, identity-provider caching, and update-management subsystems - all enterprise-critical surfaces. Each links directly to the review comment where the tool flagged the bug. Each links to the PR.
Bug #2 (IdP cache recursive caching) was missed by all three tools. MergeMitra flagged the exit-code contract issue (Bug #4) that both other tools missed, and uniquely went deeper on Bug #10 - pointing out that three sibling getSubGroupsStream() methods still lacked the null check, not just confirming the bug exists.
7. Results - Effectiveness by Category
How to read this table
If we say "Security: CodeRabbit 60%", it means: out of all the security-relevant issues that actually exist on these 20 PRs (the theoretical maximum a perfect reviewer could find), CodeRabbit identified about 60% of them. If the total number of issues is 20, CodeRabbit found 12. If MergeMitra is at 90%, it found 18.
The total number of issues in each category - the maximum possible - is a fixed denominator estimated from the combined list of every real issue any of the three tools (or our human validation) found across the 20 PRs. Higher percentages are better, except for False-positive rate where lower is better.
Combined scorecard (across both codebases, 20 PRs)
What the scorecard says
- MergeMitra leads on every dimension that matters for bug prevention. At 85% functional bug detection (17 of 20 planted bugs), it is dominant - and that gap widens on security, performance, test quality, maintainability, architecture and domain awareness.
- CodeRabbit and Greptile are closer to each other than either is to MergeMitra. CodeRabbit caught 13 of 20 bugs (65%); Greptile caught 12 of 20 (60%). The difference between them is what they miss - CodeRabbit tends to miss subtle bugs requiring domain knowledge (SQL injection, blacklist bypass), while Greptile tends to miss bugs requiring cross-file reasoning (2FA backup-code reuse, translation mapping errors).
- Greptile has one genuine super-power: signal-to-noise. Almost everything it says is correct. Almost nothing it says is fluff. But the price of that conservatism is breadth - it stays silent on a lot of real issues.
- CodeRabbit is broad but volatile. It picks up issues across many dimensions, but it also produces noise (a "LGTM!" praise block posted as a finding; a likely-fabricated CVE identifier; one PR with zero inline findings at all). Reviewers have to triage.
8. Tool-by-Tool Observations
Each tool has a recognizable personality across the 60 reviews we collected.
MergeMitra - the senior reviewer
The only tool whose reviews consistently feel like they were written by an experienced engineer who already knows the codebase. It doesn't just point at the line that changed - it reasons about what the change implies for the rest of the system, asks whether the fix actually fixes the problem, and routinely flags issues that span three or four files.
The decisive behavior: MergeMitra is the only tool that goes beyond confirming a bug exists to asking whether the fix is complete. On the concurrent-group NPE fix (Keycloak Bug #10), all three tools flagged the null-check problem - but only MergeMitra pointed out that three sibling getSubGroupsStream() methods still lacked the same null check, meaning the fix was incomplete. That single question - "did this PR actually fix the problem?" - is the difference between a junior and a senior reviewer.
The numbers back this up. MergeMitra caught 17 of 20 planted bugs (85%) - versus CodeRabbit's 13 (65%) and Greptile's 12 (60%). Among those, four bugs were caught only by MergeMitra and by no other tool (Cal.com Bugs #1, #4, #9; Keycloak Bug #4).
The trade-off is volume. MergeMitra is happy to leave 8–13 comments on a busy PR. The signal density is high (almost no false positives in 171 findings), but reviewers have to be willing to read.
Greptile - the precise sniper
When Greptile speaks, it is almost always correct. Zero false positives across 63 findings. Cleanly labeled severities. Concise comments. No fluff.
The trade-off is narrowness. Greptile reviews the file in front of it; it rarely follows the code into other files, and it does not engage with test quality, architecture or long-term maintainability. On two of the highest-stakes PRs in our study - the 2FA backup-code feature on Cal.com and the rolling-updates feature on Keycloak - Greptile posted comments and still missed the planted bug. It also missed the SQL injection risk on the Insights raw-query refactor (Cal.com Bug #4, Critical) and the email blacklist bypass on the Add-Guests feature (Cal.com Bug #9, High) - both of which only MergeMitra caught.
If you want a quiet second opinion - "tell me only the things I really need to act on" - Greptile is a defensible pick. If you want a tool that catches the deep stuff, it isn't.
CodeRabbit - the broad generalist with a noise problem
CodeRabbit covers a lot of ground. It is genuinely strong on content-layer review (it caught Italian text mistakenly bundled into the Lithuanian translation file, and Traditional Chinese characters bundled into the Simplified Chinese file - exactly the tedious work humans do badly). It produces ready-to-apply diff suggestions for almost every finding, which is a real UX win.
But it has reliability problems. In our study, CodeRabbit:
- Posted "LGTM! Comprehensive documentation…" as an inline finding (this is praise, not a finding).
- Cited a specific CVE identifier (
CVE-2025-66021) with implausible precision - a classic AI hallucination pattern, and a particularly dangerous one because security teams trust CVE-shaped claims by default. - Produced zero inline comments on one PR (Keycloak rolling-updates) where the other two tools found 3 and 5 substantive issues.
- Buried Critical findings inside collapsed summary blocks instead of posting them as inline comments - meaning a reviewer scanning inline comments would miss them.
If your team is willing to treat CodeRabbit's output as a starting point and triage carefully, the autofix UX is genuinely helpful. If your team treats AI suggestions at face value, the noise becomes a liability.
9. Four Findings That Defined the Study
The verification tables tell you what each tool caught. These four findings show you why the gaps matter. Each one was confirmed against the actual source code, and in every case, the verification data shows a clear difference between the tools.
Finding 1 - SQL Injection in a Performance Refactor (Cal.com Bug #4, Critical)
Cal.com's InsightsBookingService was refactored from Prisma's type-safe query builder to raw Prisma.sql queries for performance. The new getBaseConditions() function constructed SQL fragments that callers composed with string interpolation - introducing a SQL injection surface on an analytics endpoint that handles user-supplied filter parameters.
This is the single highest-severity bug that only one tool caught. A Critical SQL injection, hidden inside a performance optimization, on a production analytics path. The other two tools reviewed the same diff and did not flag it.
Finding 2 - Email Blacklist Bypass via Case Sensitivity (Cal.com Bug #9, High)
The new Add-Guests handler checked incoming email addresses against a blacklist, but the comparison was case-sensitive. An attacker could bypass a restriction on blocked@example.com by submitting Blocked@Example.com. Since email local-parts are case-insensitive by convention (and case-insensitive in all major providers), this is a real bypass.
This kind of bug requires knowing that email blacklists must normalize before comparison - domain awareness of how the feature is used, not just what the code does syntactically.
Finding 3 - 2FA Backup Code Reuse (Cal.com Bug #2, Critical)
The new 2FA backup-code feature allowed backup codes to be used for authentication - but the codes were not invalidated after use. A stolen backup code could be reused indefinitely, defeating the purpose of the one-time-use security model.
Both MergeMitra and CodeRabbit caught this one - which is to CodeRabbit's credit. But Greptile posted four comments on the same PR and missed the most Critical bug in it. On a PR that modifies the authentication path, missing the backup-code reuse vulnerability is a significant gap.
Finding 4 - Incomplete NPE Fix (Keycloak Bug #10, Medium)
The concurrent-group-access PR added a null check to getSubGroupsCount() to prevent a NullPointerException during concurrent group deletion. All three tools flagged the basic null-check issue. But MergeMitra went one step further: it pointed out that three sibling getSubGroupsStream() methods still executed the same modelSupplier.get().getSubGroupsStream(...) pattern without the null guard. The fix patched one of four vulnerable methods and left the other three exposed to the same race condition.
This is the clearest example of the difference between "catching a bug" and "reviewing like a senior engineer." All three tools saw the problem. Only one asked whether the fix was complete.
10. Recommendation
Pick MergeMitra.
Across 20 PRs, on two radically different stacks (Java enterprise and TypeScript modern web), judged independently by Claude Code, MergeMitra is the only tool that consistently produced senior-engineer-quality reviews. It found the planted bug in 17 of the 20 PRs (85%) - versus CodeRabbit's 13 (65%) and Greptile's 12 (60%). It was the only tool to catch the Critical SQL injection on Cal.com's raw-query refactor and the only tool to catch the email-blacklist bypass on the Add-Guests feature. And it did so with effectively zero false positives.
- Pick Greptile if your only priority is maximum signal-to-noise on a small surface area, and you accept that you will miss real security bugs on big PRs.
- Pick CodeRabbit if your priority is broad coverage with polished autofix UX, and you are willing to train reviewers to triage CodeRabbit's output (criticals can be hidden in collapsed summaries; some findings are noise).
For a single-tool enterprise pick, MergeMitra is the right answer.
11. Caveats
A few things this report is not:
- 20 PRs is a strong sample, not a procurement guarantee. If your codebase is meaningfully different (e.g., heavy mobile, embedded, or data-engineering), re-run this methodology before deciding.
- All three tools evolve continuously. This evaluation reflects their behavior in April 2026. A tool that loses today may improve, and one that wins today can regress.
- No finding was executed at runtime. Findings that depend on runtime behavior (test flakiness, actual race-window timing) are marked as partial in the underlying reports.
- Code review is not the only way to catch bugs. Tests, monitoring, fuzzing and human QA all matter. An AI reviewer is one layer in a defense-in-depth strategy, not a replacement for the others.
12. References
Everything in this report is reproducible. Below are the 60 PR review threads, the source codebases, the underlying validation reports, and the prompts we used to evaluate them.
Cal.com - 30 PR review threads
Keycloak - 30 PR review threads
Source codebases
- Cal.com - https://github.com/calcom/cal.com
- Keycloak - https://github.com/keycloak/keycloak
Claude Code prompts used for the analysis
The exact prompts given to Claude Code to produce the per-codebase reports were issued as a single instruction in each case - Claude Code was responsible for fetching the PR data via the GitHub CLI, validating each finding against the source, and producing the verdicts.