RAG: Stop Searching, Start Classifying

Why your RAG should be a library, not a search engine

Executive Summary

Observation: Similarity search (top-k) works in demos but breaks in production as soon as questions are ambiguous, multi-hop, global, or when you need to arbitrate contradictions.

Core idea: A strong RAG should behave like a library (index, categories, navigation, levels of abstraction), not a flat search engine based solely on semantic proximity.

Implication: You must structure the corpus (metadata, hierarchies, relationships) and use strategies adapted to the question type, with progressive retrieval to control cost and quality.

Options: Hierarchical indexing (levels), RAPTOR (summary tree), GraphRAG (knowledge graph + communities), Agentic RAG (librarian agent/router).

Glossary

RAG (Retrieval-Augmented Generation): approach where an LLM generates answers using context retrieved from a corpus.
Chunk: a document fragment (often a few paragraphs) indexed for retrieval and injected into the LLM context.
Embedding: vector representation of text used to measure semantic proximity.
Vector store: database optimized to store embeddings and run similarity / k-NN searches.
Similarity search / top-k: retrieving the (k) “closest” items in embedding space.
Precision / recall: retrieval metrics: precision = fraction of retrieved items that are relevant; recall = ability to include relevant items.
Multi-hop: a question that requires chaining multiple facts/relations (e.g., entity → organization → attribute).
Reranking: re-sorting candidates (often with a cross-encoder) after an initial retrieval to improve precision.
Hybrid search: combining lexical search (BM25/keywords) and vector search.
Metadata filtering: constraining retrieval by attributes (date, source, domain, type) before or during search.
Hierarchical indexing: organizing the corpus into levels (summary → sections → details) to navigate from general to specific.
RAPTOR: method that builds a tree of summaries via recursive clustering, providing multiple abstraction levels.
Knowledge graph: graph of entities and relationships enabling explicit traversals (instead of similarity approximations).
GraphRAG: approach that combines entity/relation extraction, community clustering, and summaries to answer global and multi-hop questions.
Agentic RAG: an agent that dynamically chooses the best retrieval strategy (filters, search, graph traversal, etc.) based on the question.
Progressive retrieval: retrieving a summary/card first, then drilling down into details only if needed.

The problem: similarity search is not enough

You have a corpus, embeddings, a vector store, a top-k of 5. It works in demos. In production, it breaks.

The issue is not implementation — it’s the paradigm. Similarity search relies on an implicit assumption: the most semantically similar chunk is the most relevant. That’s often false.

Five recurring failures in production

Problem	What happens	Example
Low precision	The top-k returns semantically close chunks that aren’t relevant — noise, ambiguity, superficial false positives	“vestibular treatment” matches “water treatment” because “treatment” dominates the vector
Multi-hop impossible	Questions that require chaining multiple facts consistently fail	“What degree does the CEO of the company that makes the F-150 have?” — no single chunk contains the whole answer
No aggregation	Impossible to answer corpus-wide questions	“What are the main themes?”, “How many publications talk about GVS?”
Unresolved conflicts	Two contradictory chunks, no mechanism to decide	CEO of Twitter in 2022 vs 2023 — which is “more similar”? Both.
Embeddings ≠ meaning	Vectors capture semantic proximity, not business logic	Entity relations, temporality, domain hierarchy — none of that is in an embedding

Common thread: we ask a similarity tool to do a structural understanding job. It’s like asking a spellchecker to validate an argument’s logic.

Incremental improvements: reranking and hybrid search

Two techniques have become reflexes to improve retrieval. They help — but they don’t change the paradigm.

Reranking adds a cross-encoder after top-k to better sort results. Precision improves, sometimes significantly. But the fundamental issue remains: if the right chunk isn’t in the initial candidate set, no reranker will make it appear. You optimize ordering, not coverage.

Hybrid search combines lexical search (BM25/keywords) and vector search. The gain is real — about ~20% recall improvement in common benchmarks. But the paradigm stays the same: flat search → sort → hope the right chunk is in the set. It’s a quick win, not an architectural shift.

Both techniques are useful. They should be the minimum. But treating them as the final solution is like putting a better engine in a car whose problem is the steering.

The analogy: library vs search engine

To understand the required shift, take a simple analogy.

A search engine is: you type words, you get “close” results. No structure, no navigation, no context. It’s statistical matching.

A library is: an index, categories, shelves, cards, classification codes, cross-references. The librarian doesn’t search by similarity — they navigate an organized structure. They know galvanic vestibular stimulation belongs in neuroscience, under neurophysiology, and that recent publications are sorted by date.

Today’s RAG is someone walking into a library, ignoring shelves and index, and flipping books at random looking for sentences that “look like” the question.

The paradigm shift is one sentence: move from “find the closest” to “navigate to the most relevant.”

Data is not a pile of vectors — it’s a corpus you can organize. Structure costs time to build. But it makes retrieval reliable, explainable, and auditable.

What do we actually need?

Before choosing a tool or a framework, you must state the needs. Six capabilities define robust retrieval.

#	Need	What it enables	What flat search can’t do
1	Navigation across abstraction levels	Zoom from macro (themes) to meso (summaries) to micro (chunks)	Top-k knows only one granularity level
2	Metadata filtering	Constrain search by date, source, domain, type, authority	“Most similar” ≠ “most similar within this category, after this date”
3	Entity relationships	Traverse links: author → institution → publications → themes	Every multi-hop link would need an embedding match — fragile and often impossible
4	Corpus-wide questions	“What are the main themes?”, “How many publications on X?”	Flat search is structurally unable to aggregate
5	Adaptive strategy	Retrieval method adapts to question type	Factual ≠ synthesis ≠ exploration — one pipeline won’t cover all
6	Progressive retrieval	Return a summary first, then details if relevant	Sending 15 full chunks saturates context, costs money, and drowns signal

Need #6 deserves emphasis. In production, per-query cost and answer quality depend directly on the number of tokens sent to the LLM. Progressive retrieval — fetch a card first, evaluate relevance, then fetch details — is not a nice-to-have. It’s what makes the system viable. The librarian doesn’t hand you 15 full books. They hand you a card. You decide whether you want the book.

Running example: a scientific corpus

To make this concrete, use a realistic scenario: a corpus of 500 scientific publications about galvanic vestibular stimulation (GVS) covering neurophysiology, clinical applications, and virtual reality.

Here are questions researchers actually ask — and why flat search fails on each:

User question	Primary need	Why flat search fails
“What are the major research themes in GVS?”	#4 Global aggregation	No single chunk contains “the main themes”
“Summarize Fitzpatrick lab’s work on postural GVS”	#3 Relationships + #1 Abstraction	You must link author → lab → publications → summary
“Which post-2022 publications cover GVS in VR?”	#2 Metadata + similarity	Without a time filter, top-k mixes 2005 and 2023
“Compare sinusoidal vs noise stimulation protocols”	#5 Adaptive strategy	Comparison ≠ factual lookup — you need structured sources on both sides
“Give me an overview of this paper, then details from the methods section”	#6 Progressive retrieval	Sending the whole paper wastes tokens; retrieving only “methods” chunks misses context

Each solution in the next section is illustrated by its ability to answer one of these questions.

Documented solutions

Four approaches, each targeting different needs. None is universal — the right choice depends on the real complexity of your queries.

Hierarchical indexing — level-based index

The principle is simple: organize the corpus into levels, like a table of contents. Document summaries → sections → detailed chunks. Retrieval navigates from general to specific.

In our running example, when a researcher asks for an overview and then method details, the system returns the document summary first (macro), evaluates relevance, then drills down to the “methods” chunk (micro). No token waste, no noise.

Needs covered: #1 Abstraction navigation, #6 Progressive retrieval.

The trade-off is blunt: the hierarchy must be built upfront, and it must match the real structure of the content. A bad hierarchy can perform worse than flat search.

RAPTOR — recursive summary tree

RAPTOR pushes the idea further. Instead of an imposed hierarchy, it builds a bottom-up tree: chunks are clustered, each cluster is summarized by an LLM, summaries are clustered and summarized again, and so on. The result is a navigable tree where each node provides a different abstraction level.

On our GVS corpus, “What are the major research themes?” is answered by upper nodes — where summaries capture macro trends without requiring any individual chunk to contain that information.

Needs covered: #1 Abstraction navigation, #4 Corpus-wide questions, #6 Progressive retrieval.

The cost is significant: clustering + LLM summarization at each level. Quality depends directly on summary quality. A bad intermediate summary contaminates everything above it.

GraphRAG — knowledge graph + communities

GraphRAG changes representation. Instead of treating the corpus as a set of chunks, it extracts entities and relationships to build a knowledge graph. It then applies hierarchical clustering (Leiden) to identify communities of entities and generates summaries per community.

This is the approach that best answers multi-hop questions. “Summarize Fitzpatrick lab’s work on postural GVS” requires a graph traversal: Fitzpatrick → University of New South Wales → publications → filter by postural GVS. No vector search can reliably do that path — graph traversal does it natively.

Needs covered: #3 Entity relationships, #4 Corpus-wide questions, #1 Abstraction navigation.

The trade-off is heavy: high indexing cost (LLM-based entity extraction over the full corpus), maintenance complexity, and — often ignored — GraphRAG can underperform vanilla RAG on simple factual questions. The overhead is justified only if queries are truly complex.

Agentic RAG — the librarian agent

Agentic RAG doesn’t propose a new data structure — it adds a decision layer. An agent analyzes the question, chooses the optimal retrieval strategy, and orchestrates available tools: vector search, metadata filters, SQL, graph traversal, or combinations.

That’s the librarian. For “Which post-2022 publications cover GVS in VR?”, it doesn’t run raw vector search — it filters by time (date > 2022), then by topic (domain = VR), then runs semantic search within the filtered subset.

Needs covered: #5 Adaptive strategy — and potentially all others through orchestration.

Trade-off: implementation complexity, multi-step latency, and a critical dependency on routing quality. If the agent chooses poorly, results can be worse than a simple top-k.

Comparison of solutions

Solution	Needs covered	Indexing cost	Complexity	Best use case
Hierarchical indexing	#1, #6	Medium	Low	Well-structured corpus, factual questions needing progressive zoom
RAPTOR	#1, #4, #6	High (LLM)	Medium	Unstructured corpus, multi-level summaries and corpus-wide questions
GraphRAG	#1, #3, #4	Very high (LLM)	High	Multi-hop queries, entity relations, dense technical/narrative corpora
Agentic RAG	#5 (+ all via orchestration)	Variable	High	Heterogeneous queries requiring heterogeneous strategies

In practice: custom is often the best choice

Frameworks cover generic patterns

RAPTOR, GraphRAG, LlamaIndex offer ready-to-use architectures. They are well documented, tested, and a good starting point. But every domain has its own knowledge structure. The hierarchy of a medical corpus is nothing like a regulatory database or customer support content.

Decision synthesis

How do you choose between these approaches? Here is a decision matrix based on the nature of your corpus and your queries.

Your situation	Recommended architecture	Why
Documentary corpus with clear structure (reports, regulations)	Hierarchical indexing + metadata filtering	Natural fit for an existing table of contents
Dense scientific corpus, frequent synthesis questions	RAPTOR	Can answer “What do we know about X?” without manual navigation
Highly relational data (entities, collaborations, causalities)	GraphRAG	Traversing Author–Protocol–Result relationships is essential
Unpredictable heterogeneous queries (sometimes SQL, sometimes semantic)	Agentic RAG	Maximum flexibility, at the cost of latency
Small volume (<1000 docs), simple queries	Hybrid search + reranking	Don’t over-engineer; complexity must match the need

Build your own layer

The real work isn’t choosing a tool — it’s designing the index. Understand the domain structure, identify key relationships, choose meaningful abstraction levels. A tailored structuring layer often outperforms a plug-and-play framework applied as-is.

Custom’s advantage: full control over retrieval cost, granularity, and navigation logic. Drawback: you must know what you’re doing and be ready to invest design time.

Conclusion

What changes

Moving from “embed everything and top-k” to “structure, index, categorize, and let an agent navigate” is not an optimization — it’s a paradigm shift. Structured data is not overhead. It’s an investment. Progressive retrieval is not a nice-to-have. It’s what makes the system viable in production.

The question to ask

How would a human expert search in this corpus?

If the answer is “they flip things at random and take what looks similar” — you have a problem. If the answer is “they consult the index, identify the category, read the summary, then drill down” — build that.

Sources

Reference	Type	URL
Seven Failure Points When Engineering a RAG System (2024)	Paper	arxiv.org
RAPTOR — Sarthi et al. (Stanford, 2024)	Paper	arxiv.org
GraphRAG — Edge et al. (Microsoft, 2024)	Paper	arxiv.org
IBM — RAG Problems Persist	Article	ibm.com
RAG Is a Data Engineering Problem	Article	substack.com
VectorHub — Hybrid Search & Reranking	Article	superlinked.com
5 RAG Failures + Knowledge Graphs	Article	freecodecamp.org
PIXION — Hierarchical Index Retrieval	Article	pixion.co
NirDiamant/RAG_Techniques	Repo	github.com
Beyond Vector Search — Next-Gen RAG	Article	machinelearningmastery.com
LlamaIndex — Structured Hierarchical Retrieval	Doc	llamaindex.ai
Microsoft GraphRAG	Repo	github.com
RAPTOR	Repo	github.com