
Why your RAG should be a library, not a search engine
Executive Summary
Observation: Similarity search (top-k) works in demos but breaks in production as soon as questions are ambiguous, multi-hop, global, or when you need to arbitrate contradictions.
Core idea: A strong RAG should behave like a library (index, categories, navigation, levels of abstraction), not a flat search engine based solely on semantic proximity.
Implication: You must structure the corpus (metadata, hierarchies, relationships) and use strategies adapted to the question type, with progressive retrieval to control cost and quality.
Options: Hierarchical indexing (levels), RAPTOR (summary tree), GraphRAG (knowledge graph + communities), Agentic RAG (librarian agent/router).
Glossary
- RAG (Retrieval-Augmented Generation)
- approach where an LLM generates answers using context retrieved from a corpus.
- Chunk
- a document fragment (often a few paragraphs) indexed for retrieval and injected into the LLM context.
- Embedding
- vector representation of text used to measure semantic proximity.
- Vector store
- database optimized to store embeddings and run similarity / k-NN searches.
- Similarity search / top-k
- retrieving the (k) “closest” items in embedding space.
- Precision / recall
- retrieval metrics: precision = fraction of retrieved items that are relevant; recall = ability to include relevant items.
- Multi-hop
- a question that requires chaining multiple facts/relations (e.g., entity → organization → attribute).
- Reranking
- re-sorting candidates (often with a cross-encoder) after an initial retrieval to improve precision.
- Hybrid search
- combining lexical search (BM25/keywords) and vector search.
- Metadata filtering
- constraining retrieval by attributes (date, source, domain, type) before or during search.
- Hierarchical indexing
- organizing the corpus into levels (summary → sections → details) to navigate from general to specific.
- RAPTOR
- method that builds a tree of summaries via recursive clustering, providing multiple abstraction levels.
- Knowledge graph
- graph of entities and relationships enabling explicit traversals (instead of similarity approximations).
- GraphRAG
- approach that combines entity/relation extraction, community clustering, and summaries to answer global and multi-hop questions.
- Agentic RAG
- an agent that dynamically chooses the best retrieval strategy (filters, search, graph traversal, etc.) based on the question.
- Progressive retrieval
- retrieving a summary/card first, then drilling down into details only if needed.
The problem: similarity search is not enough
You have a corpus, embeddings, a vector store, a top-k of 5. It works in demos. In production, it breaks.
The issue is not implementation — it’s the paradigm. Similarity search relies on an implicit assumption: the most semantically similar chunk is the most relevant. That’s often false.
Five recurring failures in production
| Problem | What happens | Example |
|---|---|---|
| Low precision | The top-k returns semantically close chunks that aren’t relevant — noise, ambiguity, superficial false positives | “vestibular treatment” matches “water treatment” because “treatment” dominates the vector |
| Multi-hop impossible | Questions that require chaining multiple facts consistently fail | “What degree does the CEO of the company that makes the F-150 have?” — no single chunk contains the whole answer |
| No aggregation | Impossible to answer corpus-wide questions | “What are the main themes?”, “How many publications talk about GVS?” |
| Unresolved conflicts | Two contradictory chunks, no mechanism to decide | CEO of Twitter in 2022 vs 2023 — which is “more similar”? Both. |
| Embeddings ≠ meaning | Vectors capture semantic proximity, not business logic | Entity relations, temporality, domain hierarchy — none of that is in an embedding |
Common thread: we ask a similarity tool to do a structural understanding job. It’s like asking a spellchecker to validate an argument’s logic.
Incremental improvements: reranking and hybrid search
Two techniques have become reflexes to improve retrieval. They help — but they don’t change the paradigm.
Reranking adds a cross-encoder after top-k to better sort results. Precision improves, sometimes significantly. But the fundamental issue remains: if the right chunk isn’t in the initial candidate set, no reranker will make it appear. You optimize ordering, not coverage.
Hybrid search combines lexical search (BM25/keywords) and vector search. The gain is real — about ~20% recall improvement in common benchmarks. But the paradigm stays the same: flat search → sort → hope the right chunk is in the set. It’s a quick win, not an architectural shift.
Both techniques are useful. They should be the minimum. But treating them as the final solution is like putting a better engine in a car whose problem is the steering.
The analogy: library vs search engine
To understand the required shift, take a simple analogy.
A search engine is: you type words, you get “close” results. No structure, no navigation, no context. It’s statistical matching.
A library is: an index, categories, shelves, cards, classification codes, cross-references. The librarian doesn’t search by similarity — they navigate an organized structure. They know galvanic vestibular stimulation belongs in neuroscience, under neurophysiology, and that recent publications are sorted by date.
Today’s RAG is someone walking into a library, ignoring shelves and index, and flipping books at random looking for sentences that “look like” the question.
The paradigm shift is one sentence: move from “find the closest” to “navigate to the most relevant.”
Data is not a pile of vectors — it’s a corpus you can organize. Structure costs time to build. But it makes retrieval reliable, explainable, and auditable.
What do we actually need?
Before choosing a tool or a framework, you must state the needs. Six capabilities define robust retrieval.
| # | Need | What it enables | What flat search can’t do |
|---|---|---|---|
| 1 | Navigation across abstraction levels | Zoom from macro (themes) to meso (summaries) to micro (chunks) | Top-k knows only one granularity level |
| 2 | Metadata filtering | Constrain search by date, source, domain, type, authority | “Most similar” ≠ “most similar within this category, after this date” |
| 3 | Entity relationships | Traverse links: author → institution → publications → themes | Every multi-hop link would need an embedding match — fragile and often impossible |
| 4 | Corpus-wide questions | “What are the main themes?”, “How many publications on X?” | Flat search is structurally unable to aggregate |
| 5 | Adaptive strategy | Retrieval method adapts to question type | Factual ≠ synthesis ≠ exploration — one pipeline won’t cover all |
| 6 | Progressive retrieval | Return a summary first, then details if relevant | Sending 15 full chunks saturates context, costs money, and drowns signal |
Need #6 deserves emphasis. In production, per-query cost and answer quality depend directly on the number of tokens sent to the LLM. Progressive retrieval — fetch a card first, evaluate relevance, then fetch details — is not a nice-to-have. It’s what makes the system viable. The librarian doesn’t hand you 15 full books. They hand you a card. You decide whether you want the book.
Running example: a scientific corpus
To make this concrete, use a realistic scenario: a corpus of 500 scientific publications about galvanic vestibular stimulation (GVS) covering neurophysiology, clinical applications, and virtual reality.
Here are questions researchers actually ask — and why flat search fails on each:
| User question | Primary need | Why flat search fails |
|---|---|---|
| “What are the major research themes in GVS?” | #4 Global aggregation | No single chunk contains “the main themes” |
| “Summarize Fitzpatrick lab’s work on postural GVS” | #3 Relationships + #1 Abstraction | You must link author → lab → publications → summary |
| “Which post-2022 publications cover GVS in VR?” | #2 Metadata + similarity | Without a time filter, top-k mixes 2005 and 2023 |
| “Compare sinusoidal vs noise stimulation protocols” | #5 Adaptive strategy | Comparison ≠ factual lookup — you need structured sources on both sides |
| “Give me an overview of this paper, then details from the methods section” | #6 Progressive retrieval | Sending the whole paper wastes tokens; retrieving only “methods” chunks misses context |
Each solution in the next section is illustrated by its ability to answer one of these questions.
Documented solutions
Four approaches, each targeting different needs. None is universal — the right choice depends on the real complexity of your queries.
Hierarchical indexing — level-based index
The principle is simple: organize the corpus into levels, like a table of contents. Document summaries → sections → detailed chunks. Retrieval navigates from general to specific.
In our running example, when a researcher asks for an overview and then method details, the system returns the document summary first (macro), evaluates relevance, then drills down to the “methods” chunk (micro). No token waste, no noise.
Needs covered: #1 Abstraction navigation, #6 Progressive retrieval.
The trade-off is blunt: the hierarchy must be built upfront, and it must match the real structure of the content. A bad hierarchy can perform worse than flat search.
RAPTOR — recursive summary tree
RAPTOR pushes the idea further. Instead of an imposed hierarchy, it builds a bottom-up tree: chunks are clustered, each cluster is summarized by an LLM, summaries are clustered and summarized again, and so on. The result is a navigable tree where each node provides a different abstraction level.
On our GVS corpus, “What are the major research themes?” is answered by upper nodes — where summaries capture macro trends without requiring any individual chunk to contain that information.
Needs covered: #1 Abstraction navigation, #4 Corpus-wide questions, #6 Progressive retrieval.
The cost is significant: clustering + LLM summarization at each level. Quality depends directly on summary quality. A bad intermediate summary contaminates everything above it.
GraphRAG — knowledge graph + communities
GraphRAG changes representation. Instead of treating the corpus as a set of chunks, it extracts entities and relationships to build a knowledge graph. It then applies hierarchical clustering (Leiden) to identify communities of entities and generates summaries per community.
This is the approach that best answers multi-hop questions. “Summarize Fitzpatrick lab’s work on postural GVS” requires a graph traversal: Fitzpatrick → University of New South Wales → publications → filter by postural GVS. No vector search can reliably do that path — graph traversal does it natively.
Needs covered: #3 Entity relationships, #4 Corpus-wide questions, #1 Abstraction navigation.
The trade-off is heavy: high indexing cost (LLM-based entity extraction over the full corpus), maintenance complexity, and — often ignored — GraphRAG can underperform vanilla RAG on simple factual questions. The overhead is justified only if queries are truly complex.
Agentic RAG — the librarian agent
Agentic RAG doesn’t propose a new data structure — it adds a decision layer. An agent analyzes the question, chooses the optimal retrieval strategy, and orchestrates available tools: vector search, metadata filters, SQL, graph traversal, or combinations.
That’s the librarian. For “Which post-2022 publications cover GVS in VR?”, it doesn’t run raw vector search — it filters by time (date > 2022), then by topic (domain = VR), then runs semantic search within the filtered subset.
Needs covered: #5 Adaptive strategy — and potentially all others through orchestration.
Trade-off: implementation complexity, multi-step latency, and a critical dependency on routing quality. If the agent chooses poorly, results can be worse than a simple top-k.
Comparison of solutions
| Solution | Needs covered | Indexing cost | Complexity | Best use case |
|---|---|---|---|---|
| Hierarchical indexing | #1, #6 | Medium | Low | Well-structured corpus, factual questions needing progressive zoom |
| RAPTOR | #1, #4, #6 | High (LLM) | Medium | Unstructured corpus, multi-level summaries and corpus-wide questions |
| GraphRAG | #1, #3, #4 | Very high (LLM) | High | Multi-hop queries, entity relations, dense technical/narrative corpora |
| Agentic RAG | #5 (+ all via orchestration) | Variable | High | Heterogeneous queries requiring heterogeneous strategies |
In practice: custom is often the best choice
Frameworks cover generic patterns
RAPTOR, GraphRAG, LlamaIndex offer ready-to-use architectures. They are well documented, tested, and a good starting point. But every domain has its own knowledge structure. The hierarchy of a medical corpus is nothing like a regulatory database or customer support content.
Decision synthesis
How do you choose between these approaches? Here is a decision matrix based on the nature of your corpus and your queries.
| Your situation | Recommended architecture | Why |
|---|---|---|
| Documentary corpus with clear structure (reports, regulations) | Hierarchical indexing + metadata filtering | Natural fit for an existing table of contents |
| Dense scientific corpus, frequent synthesis questions | RAPTOR | Can answer “What do we know about X?” without manual navigation |
| Highly relational data (entities, collaborations, causalities) | GraphRAG | Traversing Author–Protocol–Result relationships is essential |
| Unpredictable heterogeneous queries (sometimes SQL, sometimes semantic) | Agentic RAG | Maximum flexibility, at the cost of latency |
| Small volume (<1000 docs), simple queries | Hybrid search + reranking | Don’t over-engineer; complexity must match the need |
Build your own layer
The real work isn’t choosing a tool — it’s designing the index. Understand the domain structure, identify key relationships, choose meaningful abstraction levels. A tailored structuring layer often outperforms a plug-and-play framework applied as-is.
Custom’s advantage: full control over retrieval cost, granularity, and navigation logic. Drawback: you must know what you’re doing and be ready to invest design time.
Conclusion
What changes
Moving from “embed everything and top-k” to “structure, index, categorize, and let an agent navigate” is not an optimization — it’s a paradigm shift. Structured data is not overhead. It’s an investment. Progressive retrieval is not a nice-to-have. It’s what makes the system viable in production.
The question to ask
How would a human expert search in this corpus?
If the answer is “they flip things at random and take what looks similar” — you have a problem. If the answer is “they consult the index, identify the category, read the summary, then drill down” — build that.
Sources
| Reference | Type | URL |
|---|---|---|
| Seven Failure Points When Engineering a RAG System (2024) | Paper | arxiv.org |
| RAPTOR — Sarthi et al. (Stanford, 2024) | Paper | arxiv.org |
| GraphRAG — Edge et al. (Microsoft, 2024) | Paper | arxiv.org |
| IBM — RAG Problems Persist | Article | ibm.com |
| RAG Is a Data Engineering Problem | Article | substack.com |
| VectorHub — Hybrid Search & Reranking | Article | superlinked.com |
| 5 RAG Failures + Knowledge Graphs | Article | freecodecamp.org |
| PIXION — Hierarchical Index Retrieval | Article | pixion.co |
| NirDiamant/RAG_Techniques | Repo | github.com |
| Beyond Vector Search — Next-Gen RAG | Article | machinelearningmastery.com |
| LlamaIndex — Structured Hierarchical Retrieval | Doc | llamaindex.ai |
| Microsoft GraphRAG | Repo | github.com |
| RAPTOR | Repo | github.com |
AiBrain