RAG in Production: Fix Chunking and Re-Ranking Before Touching Embeddings
Most RAG pipelines fail on chunking or re-ranking before they fail on embedding quality. A diagnostic-first framework for finding and fixing the right bottleneck.
Why this matters
Most RAG debugging I've seen follows the same trajectory: retrieval quality is poor, the team tries a different embedding model, quality improves marginally, and everyone moves on. Three months later, quality degrades again on a new class of queries and the cycle repeats.
I spent the better part of 2024 building and debugging RAG pipelines for regulated industries — financial document Q&A, compliance contract review, legal brief analysis. The lesson I learned, repeatedly, is that the embedding model is almost never the bottleneck. Chunking is. And if you fix chunking, re-ranking is the next highest-leverage intervention. Together they explain 80% of the retrieval failures I've diagnosed.
The order matters because the interventions have different cost curves. Changing an embedding model requires re-indexing your entire corpus — hours to days depending on scale. Changing chunking requires re-preprocessing and re-embedding — same cost. Tuning a re-ranker runs at query time with no index rebuild. If you reach for the embedding model first, you pay the highest possible cost for the least certain gain.
Here's the debugging order I now follow, and why.
1. The diagnostic before any optimization
Before changing anything, measure. "The retrievals feel wrong" is not a diagnostic — it's a symptom. The RAGAS framework gives you three metrics worth understanding before you touch a config:
- Context Precision: of the chunks retrieved, what fraction are actually relevant? Low precision means noise in the context window.
- Context Recall: of the relevant information in your corpus, what fraction did you actually retrieve? Low recall means you're missing answers.
- Faithfulness: is the generated answer grounded in the retrieved context? Low faithfulness means your LLM is hallucinating despite correct retrieval — a different problem entirely.
Run RAGAS on 50 representative queries before doing anything else. The diagnostic tells you what to fix:
- Low precision → re-ranking problem. You're retrieving relevant chunks but drowning them in noise.
- Low recall → chunking problem. The answer exists in your corpus but never lands in the context window.
- Both low → fix chunking first, then add re-ranking.
- Good precision and recall, low faithfulness → prompting or model problem, not a retrieval problem.
Most teams skip this step and optimize by intuition. Don't. Every optimization you make without a baseline RAGAS score is a guess.
2. Chunking is the highest-leverage intervention
Naive RAG tutorials chunk documents at fixed token counts — 512 tokens, 50-token overlap, done. This works for uniform prose and fails for nearly everything else.
Consider what a 512-token chunk looks like in a financial document. It might start mid-sentence in a clause, include a table header but no rows, and end before the definition that gives the clause its meaning. The embedding of that chunk will be semantically incoherent — and no embedding model is good enough to recover a meaningful retrieval from incoherent inputs.
# Naive: fixed-size chunks regardless of document structure
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunks = splitter.split_text(document_text) # splits anywhere — mid-sentence, mid-table
# Better: structure-aware splitting respects semantic boundaries
from langchain.text_splitter import MarkdownHeaderTextSplitter
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[("##", "section"), ("###", "subsection")]
)
chunks = splitter.split_text(markdown_text) # splits at heading boundaries
The fix isn't a better splitter library — it's understanding your document structure and choosing a strategy that preserves semantic units. For financial contracts: clause-level splits. For API docs: one-function-per-chunk. For meeting transcripts: speaker-turn splits. For dense prose: paragraph-level with sentence overlap.
The second structural problem is metadata. A chunk without provenance — document ID, section title, creation date, source type — is just floating text. When retrieval returns a chunk from a superseded policy dated 2022, you have no way to filter it at query time without metadata. Add source, date, section, and document type to every chunk. Filter before similarity search when you can; filter after when you must — but always filter.
After fixing chunking in a legal Q&A system we'd built, context recall jumped from 58% to 79% on our RAGAS benchmark with no other changes. We hadn't touched the embedding model.
3. Re-ranking: the retrieval step most pipelines skip
Vector similarity search retrieves documents whose embeddings are close to your query embedding in high-dimensional space. That's a good approximation for semantic relevance. It's not a good ranker.
The problem: embedding-based similarity is computed independently for each document — the model sees your query and one document at a time. A cross-encoder re-ranker sees the query and a candidate document together, which lets it model their interaction. It's slower, but it's a materially better relevance signal.
The practical pipeline: retrieve the top 50 candidates with vector search (fast, cheap), then re-rank those 50 with a cross-encoder to get your top 5 (slower, accurate). The final 5 go into the LLM context window.
from sentence_transformers import CrossEncoder
# Stage 1: fast retrieval — top 50 candidates via vector search
candidates = vectorstore.similarity_search(query, k=50)
# Stage 2: cross-encoder re-ranking — top 5 from those 50
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
scores = reranker.predict([(query, doc.page_content) for doc in candidates])
ranked = sorted(zip(scores, candidates), key=lambda x: x[0], reverse=True)
top_5 = [doc for _, doc in ranked[:5]]
What this costs: cross-encoder inference on 50 candidate pairs at query time. With ms-marco-MiniLM-L-6-v2 on CPU, that's 80–150ms added latency. On GPU, under 20ms. Acceptable for most Q&A workloads; not acceptable for sub-100ms real-time pipelines.
For real-time use cases where re-ranking latency is unacceptable: use a stronger bi-encoder at retrieval time (e.g., bge-large-en-v1.5) and skip the cross-encoder. You trade some precision for latency. That tradeoff is explicit and measurable — unlike the implicit precision loss of weak retrieval with no re-ranking.
After adding re-ranking to the same legal Q&A system, context precision jumped from 61% to 84%. Combined with the chunking fix: precision 84%, recall 79% — numbers that held in production for the following quarter.
4. Hybrid search: when BM25 helps and when it doesn't
Pure vector search underperforms on exact-match queries — product SKUs, case numbers, person names, specific version strings. If your corpus is full of identifiers, you need a keyword component.
Hybrid search combines dense retrieval (vector similarity) with sparse retrieval (BM25 keyword scoring) and merges the result lists using Reciprocal Rank Fusion. In practice, most modern vector databases (Qdrant, Weaviate, Pinecone) implement this natively — prefer the native implementation over a custom one.
The signal for when to add BM25: look at your failed retrieval cases. If failures cluster around queries with specific identifiers — names, numbers, abbreviations — add hybrid search. If failures are semantic misunderstandings (user asks "termination clause", document says "cessation of obligations") — BM25 won't help. Improve chunking or embedding coverage instead.
The alpha parameter (weight between dense and sparse scoring) needs to be tuned on your actual dataset. Start at 0.5. In my experience: financial and legal corpora trend toward alpha 0.3–0.4 (more keyword weight), general-purpose knowledge bases toward 0.6–0.7 (more semantic weight). Measure the change with your RAGAS benchmark before shipping.
One mistake I see often: teams add hybrid search before they've fixed chunking, because hybrid search sounds more sophisticated and requires fewer changes to the index. It doesn't fix a chunking problem. If your chunks are semantically incoherent, adding BM25 gives you incoherent chunks retrieved by two methods instead of one.
5. Metadata filtering before you scale
A failure mode that only appears at scale: your vector index grows to millions of documents, query latency climbs, and you respond by adding more compute. The underlying issue is that you're doing full-corpus similarity search when you only need a subset.
Metadata filtering — filtering to a subset of documents before or after similarity search — is cheaper than better hardware and solves a class of problems that embedding improvements can't address.
Pre-filtering: "retrieve only documents from Q3 2024", "retrieve only policy documents for customer tier Premium". These reduce effective index size before the expensive similarity operation. Post-filtering: "filter out any chunk whose document has superseded: true", "filter out chunks with a confidence score below threshold". These clean up noisy results after retrieval.
Both require that your metadata schema at index time is designed for the filters you'll need at query time. Design it before you need it. If you're indexing documents with a created_at date but your most frequent query type is "show me current regulations" and you have no effective_through field, you'll be filtering by date when you should be filtering by policy status — and you'll need a full reindex to fix it.
6. Embedding model selection: the last lever
After chunking, re-ranking, and optionally hybrid search, your retrieval quality should be substantially higher than baseline. At this point, the embedding model becomes the last lever to pull.
The cases where embedding model choice makes a material difference: domain-specific vocabulary that general models haven't seen in sufficient training volume — clinical notes, niche legal terminology, specialized financial instruments. And multilingual retrieval where language-specific fine-tuning matters.
For English-language general-purpose text, the quality gap between the top five models on the MTEB leaderboard is smaller than the gap between a well-chunked corpus and a poorly-chunked one. Switching from text-embedding-ada-002 to text-embedding-3-large gives you 3–7 points of MTEB gain that translates to maybe 2–4 points on your production queries — if chunking is already optimal.
When fine-tuning is worth it: if you have a labeled retrieval dataset (query → relevant document pairs) of at least 500–1000 examples, fine-tuning a bi-encoder on your domain can yield 5–15 points of recall improvement. That's meaningful but expensive — dataset collection, training infrastructure, re-indexing the full corpus on every model update.
Before investing in fine-tuning: run RAGAS post-chunking, post-re-ranking. If recall is still below 75%, something is structurally wrong — fine-tuning won't fix it. If recall is above 85%, fine-tuning is likely not worth the cost. The sweet spot for fine-tuning is a well-structured pipeline with a documented domain-specific retrieval gap.
The category frame
Retrieval-Augmented Generation is often treated as a single monolith. In practice it's a stack of independent choices: chunking strategy, embedding model, retrieval mechanism (vector, keyword, hybrid), re-ranker. Each has its own failure modes, cost structure, and tuning levers.
The debugging order I follow: measure first, fix chunking, add re-ranking, consider hybrid search if keyword queries are failing, tune the embedding model last if at all. That sequence saves the most expensive interventions for problems that actually require them.
If you're running a RAG pipeline in production and your RAGAS numbers don't match your users' experience — or building one and want to skip the debugging cycle — write me.