RAG in Production: Fix Chunking and Re-Ranking Before Touching Embeddings
Most RAG pipelines fail on chunking or re-ranking before embedding quality. A diagnostic-first framework for finding and fixing the right bottleneck.
Why this matters
Most RAG debugging I see follows the same pattern: retrieval quality is poor, someone tries a different embedding model, quality improves a little, and the issue is treated as closed. Three months later, quality degrades again on a new class of queries and the cycle repeats.
I spent the better part of 2024 building and debugging RAG pipelines for regulated industries — financial document Q&A, compliance contract review, legal brief analysis. The lesson I learned, repeatedly, is that the embedding model is rarely the first bottleneck. In the systems I have worked on, chunking has often been the first structural issue, and chunking plus re-ranking explained many of the retrieval failures I diagnosed.
The order matters because the interventions have different cost curves. Changing an embedding model means re-indexing the corpus, which can take hours or days depending on scale. Changing chunking means re-preprocessing and re-embedding, so the cost is similar. Tuning a re-ranker runs at query time and does not require an index rebuild. When I reach for the embedding model first, I often pay a high cost before I know whether the expected gain is likely.
Here's the debugging order I now follow, and why.
1. The diagnostic before any optimization
Before changing anything, I establish a baseline. "The retrievals feel wrong" is not a diagnostic. It's a symptom. I use the RAGAS framework to separate three failure modes before I touch a config:
- Context Precision: of the chunks retrieved, what fraction are actually relevant? Low precision means noise in the context window.
- Context Recall: of the relevant information in the corpus, what fraction did I actually retrieve? Low recall means the answer is being missed.
- Faithfulness: is the generated answer grounded in the retrieved context? Low faithfulness means the LLM is hallucinating despite correct retrieval — a different problem entirely.
I run RAGAS on a representative query set before doing anything else. The diagnostic tells me where to look:
- Low precision → likely re-ranking problem. The pipeline retrieves relevant chunks but drowns them in noise.
- Low recall → likely chunking problem. The answer exists in the corpus but never lands in the context window.
- Both low → I usually fix chunking first, then add re-ranking.
- Good precision and recall, low faithfulness → prompting or model problem, not a retrieval problem.
This step is easy to skip, and I see it skipped often. I try not to. Without a baseline RAGAS score, every optimization I make is a guess.
2. Chunking is usually the first structural intervention
Naive RAG tutorials chunk documents at fixed token counts — 512 tokens, 50-token overlap, done. This can work for uniform prose. In my experience, it fails quickly when the document structure carries meaning.
A 512-token chunk in a financial document can be messy. It might start mid-sentence in a clause, include a table header but no rows, and end before the definition that gives the clause its meaning. The embedding for that chunk is semantically incoherent. I do not expect an embedding model to recover reliable retrieval from incoherent inputs.
# Naive: fixed-size chunks regardless of document structure
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunks = splitter.split_text(document_text) # splits anywhere — mid-sentence, mid-table
# Better: structure-aware splitting respects semantic boundaries
from langchain.text_splitter import MarkdownHeaderTextSplitter
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[("##", "section"), ("###", "subsection")]
)
chunks = splitter.split_text(markdown_text) # splits at heading boundaries
The fix isn't a better splitter library — it's understanding the document structure and choosing a strategy that preserves semantic units. For financial contracts, I usually split at the clause level. For API docs, I prefer one function per chunk. For meeting transcripts, speaker turns have worked better in the systems I have handled. For dense prose, I use paragraph-level chunks with sentence overlap.
The second structural problem is metadata. A chunk without provenance — document ID, section title, creation date, source type — is just floating text. If retrieval returns a chunk from a superseded policy dated 2022, I need metadata to filter it at query time. In my pipelines, I add source, date, section, and document type to every chunk. I filter before similarity search when I can, and after similarity search when I must.
After fixing chunking in a legal Q&A system I had built, context recall increased on the RAGAS benchmark I was using. I had not touched the embedding model. I treat that as a benchmark result for that system, not as a general guarantee.
3. Re-ranking: the retrieval step many pipelines skip
Vector similarity search retrieves documents whose embeddings are close to the query embedding in high-dimensional space. That's a useful approximation for semantic relevance. It is not a ranker I trust by itself.
The issue is simple: embedding-based similarity is computed independently for each document. The model sees the query and one document at a time. A cross-encoder re-ranker sees the query and a candidate document together, so it can model their interaction. It is slower, but in my benchmarks it has often produced a better relevance signal for the final context set.
The pipeline I usually use is straightforward: retrieve the top 50 candidates with vector search, then re-rank those 50 with a cross-encoder to get the top 5. The final 5 go into the LLM context window.
from sentence_transformers import CrossEncoder
# Stage 1: fast retrieval — top 50 candidates via vector search
candidates = vectorstore.similarity_search(query, k=50)
# Stage 2: cross-encoder re-ranking — top 5 from those 50
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
scores = reranker.predict([(query, doc.page_content) for doc in candidates])
ranked = sorted(zip(scores, candidates), key=lambda x: x[0], reverse=True)
top_5 = [doc for _, doc in ranked[:5]]
What this costs: cross-encoder inference on 50 candidate pairs at query time. With ms-marco-MiniLM-L-6-v2, I have seen CPU latency become noticeable on my own workloads and hardware. I treat that as a workload-specific measurement, not as a general latency claim. For the Q&A workloads I have worked on, that can be acceptable. It is not acceptable for pipelines with strict sub-100ms latency targets.
For real-time use cases where re-ranking latency is unacceptable, I usually use a stronger bi-encoder at retrieval time, such as bge-large-en-v1.5, and skip the cross-encoder. That trades some precision for latency. I prefer making that tradeoff explicit and measurable instead of accepting the hidden precision loss of weak retrieval with no re-ranking.
After adding re-ranking to the same legal Q&A system, context precision increased on my RAGAS benchmark. Combined with the chunking fix, the measured retrieval failures aligned more closely with the failures users had reported.
4. Hybrid search: when BM25 helps and when it doesn't
Pure vector search can underperform on exact-match queries — product SKUs, case numbers, person names, specific version strings. When I see a corpus full of identifiers, I test a keyword component.
Hybrid search combines dense retrieval with sparse retrieval. Dense retrieval uses vector similarity. Sparse retrieval uses BM25 keyword scoring. The result lists are then merged, often with Reciprocal Rank Fusion. In practice, many vector databases (Qdrant, Weaviate, Pinecone) support this natively. I prefer the native implementation when it is available instead of maintaining a custom one.
Failed retrieval cases tell me when to add BM25. If the misses cluster around specific identifiers — names, numbers, abbreviations — I test hybrid search. If the misses are semantic, the next step is different. For example, a user may ask for "termination clause" while the document says "cessation of obligations". In the systems I have evaluated, BM25 usually has not helped much with that class of failure, so I look again at chunking or embedding coverage.
The alpha parameter, which weights dense versus sparse scoring, needs to be tuned on the actual dataset. I start at 0.5. In the evaluations I have run, I have sometimes seen financial and legal corpora move toward alpha 0.3–0.4, with more keyword weight. I have also seen general-purpose knowledge bases move toward 0.6–0.7, with more semantic weight. Before shipping, I measure the change with the RAGAS benchmark.
One mistake I see often: hybrid search gets added before chunking is fixed, because it sounds more sophisticated and requires fewer changes to the index. It does not fix a chunking problem. If the chunks are semantically incoherent, adding BM25 gives me incoherent chunks retrieved by two methods instead of one.
5. Metadata filtering before you scale
A failure mode that only appears at scale: the vector index grows to millions of documents, query latency climbs, and the first response is to add more compute. In my workloads, I usually look for a filtering problem first.
Metadata filtering — filtering to a subset of documents before or after similarity search — can be cheaper than better hardware. It also solves a class of problems that embedding improvements cannot address.
Pre-filtering examples: "retrieve only documents from Q3 2024", "retrieve only policy documents for customer tier Premium". These reduce effective index size before the expensive similarity operation. Post-filtering examples: "filter out any chunk whose document has superseded: true", "filter out chunks with a confidence score below threshold". These clean up noisy results after retrieval.
Both require a metadata schema designed at index time for the filters I will need at query time. I design it before I need it. If I index documents with a created_at date but the most frequent query type is "show me current regulations", I also need a policy-status concept such as effective_through or an equivalent field. Without it, I end up filtering by date when I should be filtering by status, and the fix usually requires a full reindex.
6. Embedding model selection: the last lever
After chunking, re-ranking, and optionally hybrid search, retrieval quality has usually been higher than the baseline in the systems I evaluate. At that point, the embedding model becomes the last lever I pull.
In my work, embedding model choice has mattered most in two cases. The first is domain-specific vocabulary that general models may not have seen in sufficient training volume: clinical notes, niche legal terminology, specialized financial instruments. The second is multilingual retrieval, where language-specific fine-tuning can matter.
For English-language general-purpose text, the quality gap between leading models on public embedding benchmarks is often smaller than the gap I see between a well-chunked corpus and a poorly chunked one. Switching embedding models can help, but I do not expect it to compensate for broken chunking.
When fine-tuning is worth it: I only consider fine-tuning when I have a labeled retrieval dataset, such as query → relevant document pairs, with enough examples to evaluate changes reliably. Fine-tuning a bi-encoder on the domain can improve recall, but it adds real cost: dataset collection, training infrastructure, and re-indexing the full corpus on every model update.
Before investing in fine-tuning, I run RAGAS after chunking and re-ranking. If recall is still low, I look for a structural problem first. I do not expect fine-tuning to fix broken document boundaries or missing metadata. If recall is already high and the remaining failures are narrow, fine-tuning may not be worth the cost. The case where I consider it seriously is a well-structured pipeline with a documented domain-specific retrieval gap.
The category frame
Retrieval-Augmented Generation is often treated as a single monolith. In practice, I treat it as a stack of independent choices: chunking strategy, embedding model, retrieval mechanism (vector, keyword, hybrid), and re-ranker. Each has its own failure modes, cost structure, and tuning levers.
The debugging order I follow is stable in my work: measure first, fix chunking, add re-ranking, consider hybrid search if keyword queries are failing, and tune the embedding model last if at all. That sequence keeps high-cost interventions for problems that the earlier diagnostics did not resolve.
If I can help you debug a RAG pipeline in production when RAGAS numbers do not match user experience — or help design one with reproducible evals from the start — write me.
FAQ
How should I diagnose poor RAG retrieval before changing configs?
I measure first with RAGAS on a representative query set. I separate Context Precision, Context Recall, and Faithfulness before touching a config. Low precision points to re-ranking, low recall points to chunking, both low means I usually fix chunking first and then re-rank, and low faithfulness with good retrieval points to prompting or model behavior.
Why do I fix chunking before switching embedding models?
I do not expect an embedding model to recover reliable retrieval from incoherent inputs. Fixed-size chunks can split clauses, tables, definitions, or other semantic units. I choose a strategy that preserves document structure, such as clause-level chunks for contracts, function-level chunks for API docs, or paragraph chunks with sentence overlap for dense prose.
When should I add a cross-encoder re-ranker to a RAG pipeline?
I add re-ranking when precision is low and retrieved context contains too much noise. My usual pipeline retrieves the top 50 candidates with vector search, then uses a cross-encoder to select the top 5 for the LLM context window. I treat the added query-time latency as workload-specific and measure it before shipping.
When does hybrid search help more than pure vector search?
I test hybrid search when failed retrieval cases cluster around exact identifiers such as product SKUs, case numbers, person names, or version strings. If the failures are semantic misunderstandings, BM25 usually has not helped in the systems I have evaluated, and I look again at chunking or embedding coverage.
When is embedding model tuning worth considering?
I treat the embedding model as the last lever after chunking, re-ranking, and optional hybrid search. I consider fine-tuning only when I have a labeled retrieval dataset, enough examples to evaluate reliably, and a well-structured pipeline with a documented domain-specific retrieval gap.
Related articles
Five Function-Calling Patterns That Survived Production
Tool use is where LLM systems fail most reliably. A catalog of five patterns that held under production load — and the anti-patterns they replaced.
Nov 25, 202411 min read#LLM#Function Calling#OpenAI#ProductionWhy Autonomous Research Agents Hallucinate — and How a Critic Loop Fixes It
Planner-executor agents fail on verifiability. Adding a critic agent with independent source access is the structural fix that survives adversarial queries.
Nov 15, 202410 min read#AI Agents#Research#LLM#ProductionState as the API: LangGraph After Three Rewrites
The state schema is the most consequential decision in LangGraph. Three iterations on modeling it, and why channels with reducers are the right primitive.
Jan 8, 202512 min read#LangGraph#LLM#Multi-Agent#Orchestration