The Limits of Pure Vector Search

Dense retrieval is remarkably good at semantic similarity. Ask it for "documents about contract termination clauses" and it will find them, even if the documents use different terminology. What dense retrieval can't do well is multi-hop reasoning. "What are the implications of clause 14.2 for the penalty provisions in section 8, given the force majeure definition in the appendix?" requires traversing relationships between entities across multiple documents. Embedding similarity won't get you there. The failure mode is invisible. A RAG system retrieves the top-k most similar chunks and returns them. If the relevant information is distributed across multiple documents in a non-obvious way, the retrieval returns plausible but incomplete context. The model generates a confident answer from insufficient evidence. Users rarely notice until they find a consequential error.

The Limits of Pure Graph Traversal

Knowledge graphs encode entities and relationships explicitly. They're excellent at multi-hop queries: follow the "has-clause" edge from contract entity, traverse the "references" edge to penalty section, resolve the "defined-by" edge to the force majeure definition. But graphs are rigid. They can only answer questions that map onto their schema. Novel queries that don't fit the entity-relationship structure fail silently β€” the traversal returns nothing, or returns a path that doesn't match the user's actual intent. Building and maintaining a knowledge graph is also expensive. Every entity must be extracted, disambiguated, and linked. The schema must be designed upfront and is costly to change. For dynamic document collections where new document types appear regularly, pure graph approaches often can't keep pace.

The Hybrid Architecture

The architecture we built combines both approaches with a query routing layer that decides which retrieval strategy to use β€” and often uses both. At ingestion: documents go through dual processing. A dense embedding pipeline indexes chunks for semantic similarity. An extraction pipeline identifies entities, relationships, and structured facts and writes them to a graph database. Both representations are maintained in sync. At query time: an intent classifier determines whether the query is primarily semantic (find similar content) or relational (traverse relationships between entities). Semantic queries go to dense retrieval. Relational queries go to graph traversal. Complex queries use both: dense retrieval identifies the relevant document subgraph, graph traversal extracts the specific relationships needed.

"The precision improvement from hybrid retrieval isn't incremental β€” it's categorical. Multi-hop questions that pure vector search simply couldn't answer became reliably answerable."

The 10Γ— Precision Improvement

We measured precision at k=5 (did the retrieved context contain the information needed to answer the question?) across a test set of 400 queries. Pure dense retrieval: 61% precision. Graph-only traversal: 44% precision on the full test set (it excelled on structured queries but failed on the majority that required semantic search). Hybrid routing: 83% precision. The 10Γ— improvement headline refers to multi-hop queries specifically β€” the case where graph traversal was relevant. On those queries, pure dense retrieval achieved 8% precision. The hybrid system achieved 79%. This is the problem the hybrid architecture is designed to solve: it doesn't uniformly improve all retrieval, it dramatically improves the class of queries that pure semantic search can't handle. 83% precision is not 100%. The remaining errors are largely queries that require synthesis across many documents β€” a different problem that retrieval architecture alone can't solve.

Operational Considerations

The hybrid architecture is substantially more complex to operate than pure dense retrieval. Two storage systems (vector database and graph database) must stay in sync. The extraction pipeline needs robust entity disambiguation. The query routing classifier needs training data. Our recommendation: don't build this unless you have a demonstrated need for multi-hop retrieval. If your queries are primarily semantic similarity β€” "find documents like this" or "find documents about X" β€” dense retrieval alone is far simpler and likely sufficient. The signal that you need hybrid retrieval: users asking questions about relationships between entities, questions that require synthesizing information from multiple specific documents, or factual questions with clear answers that your RAG system consistently gets wrong. If you're seeing those failure patterns, the investment in a hybrid architecture is likely justified.