Does RAG Actually Help AI Coding Tools?

I daily drive Claude Code. It’s amazing, but I’m always baffled by the search. My workflow for adding blog posts is to drop the markdown and any JSX components into blog-triage/ at the project root, then get Claude to review them.

Somehow, this is what happens:

> please review blog-triage

  Searched for 2 patterns (ctrl+o to expand)

  Explore(Find blog-triage content)
    Done (32 tool uses / 32.6k tokens / 38s)
  (ctrl+o to expand)

  Read 2 files (ctrl+o to expand)

32 tool uses and 32k tokens to read 2 files, in a path that was given more or less explicitly.

I’ve built RAG systems before for domains where agents genuinely cannot work without retrieval, things like thousand-page hardware reference manuals where you need precise register maps and clock trees, not what the model thinks the peripheral does. So the current consensus that agentic search has made RAG obsolete doesn’t sit right with me. I decided to test the question properly.

The setup

search-bench runs AI coding CLI tools against a real codebase in two modes:

Native, the tool uses its built-in search: grep, glob, file reads. Whatever it ships with.
RAG (Retrieval-Augmented Generation), the tool gets access to an MCP server that provides hybrid semantic search (FAISS embeddings + SQLite FTS5 full-text). Pre-indexed. Ready to go.

I wrote 60 queries across four categories: exact symbol lookups (“where is function X?”), conceptual questions (“how does the upload pipeline work?”), cross-cutting concerns (“what files handle authentication?”), and refactoring analysis (“what would need to change to swap the rendering library?”). Each query has verified ground truth files.

The target codebase is kicad-library, a ~200-file TypeScript project and the source code to circuitsnips.com, my KiCad subcircuit sharing website. Small enough to be tractable, large enough to have real architectural concerns.

Crucially, both tools ran the same model: Haiku 4.5. Claude Code supports explicit model selection via --model, and GitHub Copilot CLI also runs Haiku 4.5. This eliminates the model capability confound entirely. Any difference in the numbers is due to the tool’s search architecture, not the underlying model.

Each tool ran every query 3 times in both modes. Native and RAG phases ran separately: all native queries completed first with no MCP server configured, then the MCP server was started and all RAG queries ran. Claude Code completed all 360 runs; Copilot completed 325 (I maxed out my Copilot Pro subscription, and have 2 days before it resets). That gives ~685 data points across the two tools.

I also ran Codex CLI and Gemini CLI, but neither produced reliable data. Codex returned empty answers beyond exact symbol lookups and Gemini hit persistent auth failures in subprocess mode. Since this benchmark compares search modes, not tools, I discarded both rather than present incomplete data. (this may or may not be a future improvement)

Three caveats before you look at the data

The Copilot parser is imperfect. My file extraction regex struggled with Copilot’s output format. Copilot often returned detailed, correct answers that the scoring couldn’t parse, so its true recall is likely higher than reported. This hits some categories harder than others (exact symbol lookups show 0.200 recall despite Copilot clearly finding the right files in the raw output). The Copilot absolute numbers are best treated as lower bounds. The relative change between native and RAG within Copilot is more reliable, because parsing errors affect both modes roughly equally. Low extraction confidence results were excluded.

The MCP server is warm, not cold. In this benchmark, the MCP server starts once before the RAG phase and stays running for all queries. This is closer to how you’d actually deploy it (a background process with the index already loaded). It does mean the RAG speed numbers don’t include cold-start overhead, which would add ~12 seconds per query if the server restarted each time.

200 files is a small codebase. Native grep on a 200-file TypeScript project takes milliseconds. RAG is designed for scale, for 10,000+ file monorepos where context windows overflow and keyword search starts failing. This benchmark answers: “does RAG matter for small-to-medium codebases?” It does not answer what happens at scale, where the balance almost certainly shifts in RAG’s favour. I’d like to test that, but haven’t yet.

One more note: Copilot’s IDE integration likely uses workspace indexing for search, but when invoked as a CLI subprocess (as in this benchmark), it falls back to basic tool calls with no pre-built index, to the best of the author’s understanding. The native-mode results here may not reflect Copilot’s full search capability in VS Code.

The numbers

Claude Code

Haiku 4.5 · 360 runs

Recall

0.907

+0.017

Precision

0.409

+0.028

0.533

+0.026

Native recall0.907

RAG recall0.924

Time to resolution: 37s native → 35s RAG -2.400s

GitHub Copilot

Haiku 4.5 · 325 runs

Recall

0.604

+0.013

Precision

0.256

+0.077

0.338

+0.066

Native recall0.604

RAG recall0.617

Time to resolution: 61s native → 34s RAG -26.700s

Benchmark: 60 queries across 4 categories (exact symbol, conceptual, cross-cutting, refactoring) against a ~200-file TypeScript codebase. Both tools on Haiku 4.5. RAG via warm MCP server (FAISS + SQLite FTS5). Per-tool semaphore; sequential execution within each tool. Native and RAG phases run separately. Copilot absolute numbers are lower bounds due to parser limitations (see caveats).

The headline: tool design dominates

Both tools are running the same model. Both have access to the same codebase. Both get the same 60 queries. The difference is how they search.

Claude Code native achieves 0.907 recall in 37 seconds. Copilot native achieves 0.604 recall in 61 seconds. That’s a 30-percentage-point recall gap and a 1.6x speed gap, on the same model. Adding RAG doesn’t close it: Claude Code native (0.907) still outperforms Copilot RAG (0.617) by 29 points.

This isn’t a model story. It’s a tool story. Claude Code’s agentic search, how it decides what to grep for, how it reads results, how it iterates, is fundamentally more effective than Copilot’s. The same Haiku 4.5 model, given better search tooling, finds 50% more relevant files in 40% less time.

What RAG does to each tool

Claude Code: breakeven accuracy, 28% fewer tokens.

	Native	RAG	Delta
Recall	0.907	0.924	+1.7pp
Precision	0.409	0.437	+2.8pp
F1	0.533	0.559	+4.9%
Speed	37.0s	34.6s	-6%
Tokens/query	367K	265K	-28%

The accuracy story is a wash. Claude’s native search already finds 90.7% of the right files. RAG pushes that to 92.4%, but the statistical comparison shows p=0.911, well above any conventional significance threshold. The difference is indistinguishable from noise.

The token story is not a wash. Native queries consume 367K tokens on average. RAG queries consume 265K. That’s a 28% reduction in token consumption at effectively identical accuracy. RAG provides richer context in fewer round trips, so Claude makes fewer agentic iterations and reads fewer files to reach the same answer.

If you’re on a subscription, this means faster responses and less context window pressure. If you’re on the API, this is a straight cost reduction. At Anthropic API pricing, that’s big money at volume.

GitHub Copilot: transformative speed improvement, modest accuracy lift.

	Native	RAG	Delta
Recall	0.604	0.617	+1.3pp
Precision	0.256	0.333	+7.7pp
F1	0.338	0.404	+19.5%
Speed	60.8s	34.1s	-44%

RAG cuts Copilot’s response time nearly in half. In native mode, Copilot averages 13 tool calls per query, each requiring a model inference round trip. RAG gives it richer context upfront, reducing the number of iterations needed. The precision jump from 0.256 to 0.333 shows RAG is helping Copilot focus: fewer files returned, but more of them are right.

If you’re using Copilot CLI today, bolting on a semantic search MCP server might be the single highest-leverage improvement you can make.

I was unable to get meaningful token data when it ran the bench, I may run again in future to fill the blank.

Where RAG earns its keep

The accuracy lift isn’t uniform across query types.

Exact symbol lookups (“where is parseSExpression defined?”): RAG pushes Claude from 0.933 to perfect 1.000 recall. These queries are already easy for grep, but RAG catches the occasional edge case. Copilot’s low numbers here (0.200) are primarily a parsing artefact.

Conceptual questions (“how does the upload pipeline work?”): Claude goes from 0.896 to 0.933. RAG helps by surfacing semantically related files that don’t share keywords with the query. These are broad questions where multiple search strategies converge on similar results, so the lift is real but not dramatic.

Cross-cutting concerns (“what files handle rate limiting?”): Near-identical recall for both tools in both modes. Claude: 0.891 native vs 0.881 RAG. Copilot: 0.867 native vs 0.862 RAG. Neither tool benefits from RAG here, suggesting both native and semantic search are equally effective at finding cross-cutting files in a 200-file codebase.

Refactoring analysis (“what would change if we replaced the KiCad renderer?”): Copilot jumps from 0.839 to 0.932, an 11% improvement. These queries require understanding functional relationships between files, the kind of thing semantic similarity is designed for. A file called ThumbnailRegenerator.tsx doesn’t mention “renderer” in its name, but it’s semantically related. RAG finds it. Grep doesn’t. Claude stays flat at 0.907/0.880, because its native search is already picking up these relationships through iterative exploration.

The pattern: RAG helps most when the answer isn’t in the filename or the imports, and when the tool’s native search isn’t already doing multi-step reasoning to find it.

The cost case for RAG

The token data is the finding that surprised me most. Claude Code’s native search consumes 367K tokens per query on average. It’s spending those tokens strategically, grepping, reading results, narrowing down, reading more, and it converges on the right files through intelligent iteration. But it’s still burning through tokens to get there.

With RAG, Claude drops to 265K tokens per query because the semantic search front-loads relevant context. The model doesn’t need to iterate as much. Same accuracy, 28% less compute.

For subscription users, this is the difference between a slightly-faster and a slightly-slower response. But for API users, and for anyone thinking about deploying AI coding tools at team or organisation scale, a 28% token reduction at breakeven accuracy is significant. Anthropic can subsidise inference on Max subscriptions because they’re building market share. That subsidy doesn’t extend to the API, and it won’t last forever. RAG might not make your tool smarter, but it can make it substantially cheaper to run.

Copilot’s native search, by contrast, takes 13 tool calls on average and 61 seconds. It’s making more round trips but getting less value from each one. RAG compresses those iterations by giving Copilot better starting context, which is why the speed improvement is so dramatic (44%) while the recall improvement is modest (1.3pp). The model can reason about the code equally well; it’s the search scaffolding that makes the difference.

Claude Code already tried this

Here’s something worth knowing: early versions of Claude Code actually used RAG with a local vector database. They moved away from it. Boris Cherny, who works on Claude Code at Anthropic, explained the reasoning:

Agentic search generally works better. It is also simpler and doesn’t have the same issues around security, privacy, staleness, and reliability.

My data is consistent with that decision, if you’re optimising for accuracy. With the same Haiku 4.5 model, Claude Code’s native agentic search achieves 0.907 recall, and RAG adds only 1.7 percentage points. The engineering complexity of maintaining a vector index, keeping it fresh, handling edge cases around file deletions and renames, isn’t obviously worth that small a lift.

But the cost picture is different. RAG reduces token consumption by 28% at breakeven accuracy. If you’re Anthropic and you’re subsidising inference to win market share, that’s an internal infrastructure saving. If you’re an enterprise deploying Claude Code across 500 engineers, that’s a line item on a budget. The accuracy case for RAG is weak. The cost case is strong.

But that calculus doesn’t apply equally to all tools.

The case for RAG in Copilot

If your tool’s native search is less sophisticated, if it makes more round trips, takes longer to converge, and doesn’t extract as much value from each iteration, then RAG changes the equation. Copilot CLI goes from 61 seconds to 34 seconds with RAG. That’s not a marginal improvement; it transforms the experience from “go make coffee” to “instant enough to stay in flow.”

The F1 improvement of 19.5% is also meaningful. Copilot native returns an average of 11 files per query with 0.256 precision. Copilot RAG returns 4.6 files with 0.333 precision. RAG is acting as a focus mechanism, giving the model better initial context so it doesn’t need to cast as wide a net.

If you’re using Copilot CLI, add a semantic search MCP server. The setup is straightforward (FAISS + SQLite FTS5, index your codebase, expose over stdio) and the payoff is immediate.

What’s next

The biggest limitation of this benchmark is that RAG is bolted on as an MCP server alongside the tool’s existing search. That means every RAG query pays MCP protocol overhead on top of the actual retrieval, and the tool still has its native search available, so you’re measuring “native + optional RAG” rather than “RAG as the primary search strategy.”

To fix that, I’m planning to fork opencode, an open-source coding CLI, and replace the core search implementation directly. One build with standard agentic search (grep, glob, file reads, Claude Code style). One with RAG as the primary search backend. Same model, same tool, no MCP overhead. That should isolate whether RAG actually retrieves better results, or whether the modest lift I’m seeing here is just the MCP server adding a second opinion that occasionally catches something grep missed.

I’d also like to run this on larger codebases. The hypothesis is that RAG’s value increases with codebase size, and a 200-file project may be below the threshold where native search starts failing.

Finally, I used the RAG approach that I built for indexing PDF refence manuals, I haven’t evaluated different embedding models, to see if any perform differently in this application, FAISS would remain, as would the FTS5, but there’s likely optimization that could be made.

The bottom line

When you control for model capability, the gap between tools is enormous and the gap between search modes is small.

Claude Code’s agentic search achieves 0.907 recall in 37 seconds. The same model through Copilot’s search achieves 0.604 recall in 61 seconds. That’s a 50% recall advantage and 40% speed advantage from tool design alone, not model capability.

RAG helps both tools, but differently. For Claude Code, it’s breakeven accuracy (+1.7pp recall, not statistically significant) but a 28% reduction in token consumption. That’s not an accuracy play, it’s a cost play, and at API pricing or team scale it’s a meaningful one. For Copilot, it’s a transformative speed improvement (-44%) and a meaningful F1 lift (+19.5%), because RAG compresses the iterative search process that Copilot does less efficiently.

The honest answer for a 200-file codebase: good search tooling matters more than retrieval strategy. RAG earns its keep on the hard queries and as a speed optimiser for tools with expensive search loops, but it can’t compensate for fundamentally less effective search architecture. If you’re choosing between investing in better agentic search or bolting on RAG, the data says invest in the search.

Whether that changes at scale, or when RAG replaces the search rather than supplementing it, is the next question. I’ll update this when I have the data.

*The benchmark code, results, and interactive dashboard are open source at search-bench.