Nov 10, 2025
/
AI Advice
Every AI team hits a wall. Ours had ‘RAG’ written all over it.
RAG Hit a Wall - How We Rebuilt Legal AI to Actually Scale
Every AI team hits a wall. Ours had RAG written all over it.
When I joined Jus Mundi, our first Legal AI worked and delivered value. But as we expanded to more real legal workflows, it stopped scaling. The stack looked perfect on paper, closely following the well known Retrieval-Augmented Generation for Large Language Models: A Survey - and that was exactly the problem. To understand why, it helps to remember what RAG promises in theory and where it fails in practice for high precision domains like law. For an overview of the RAG paradigm and evaluation methods, see the latest arXiv surveys. (arXiv)
Where textbook RAG broke for us
1) No clear OMTM - One Metric That Matters
Were we optimizing for precision, recall, or answer quality. Trying to maximize everything meant we optimized nothing the way lawyers define good.
2) Metric saturation
Raising top k should lift recall. If recall flattens even as k increases, your pipeline has a structural ceiling that no LLM swap will fix.
3) Over engineering retrieval
Legal retrieval needs domain adaptation. We drifted into brittle rules and if elses - the symbolic AI trap. Domain logic must be learned, not hardcoded.
4) The lawyer problem
Legal language is unforgiving. Mixing rule versions or confusing parties kills credibility. Citations are a minefield. Even great general purpose models make these mistakes without better retrieval and orchestration.
This is why RAG demos can shine while production systems stall. Industry wide excitement about RAG is justified - but adoption at scale still demands domain specific choices. (TIME)
The 4 step rebuild that worked
1) Pick an OMTM - we chose recall
In law, missing a key precedent is not an option. We anchored on recall first, then layered precision and answer quality as second order effects. If you cannot reliably retrieve everything that matters, nothing downstream can rescue the answer.
What we measured: recall at k curves, coverage versus token cost, citation faithfulness, and user perceived completeness.
2) Rebuild embeddings for arbitration and legal text
Off the shelf embeddings saturated too early. We trained for legal semantics so we could push k higher without drowning the LLM in noise. That unlocked richer context while keeping token costs in check.
Heuristic to track: k → tokens → latency → recall. If recall does not move as k increases, fix embeddings and retrieval before touching the model.
For background on RAG components and evaluation, see the arXiv survey synthesis. (arXiv)
3) De symbolize retrieval
Instead of hand written rules, we moved to learned selection guided by legal signals like document type, parties, procedural posture, dates, and clause versions. Keep heuristics small, auditable, and backed by data.
4) Build a proprietary multi agent system that thinks like a lawyer
We designed orchestration that mirrors how arbitration lawyers actually work: a planner to decompose the question, a researcher to run targeted passes, an analyst to separate arguments and versions, and a citer to ground conclusions with precise, checkable references.
We also removed abstraction layers that added complexity without value for our domain. For teams relying on generic agent frameworks, evaluate the tradeoffs carefully and read the platform’s own positioning about agents and orchestration to decide what you should own. See the official LangChain site and repository for context. (LangChain)
Results in under four months
Approximately 125 percent higher recall versus our previous stack
Significant gains in citation faithfulness and argument separation
More use cases unlocked without runaway token costs thanks to better retrieval and tighter orchestration
Checklist - signs your RAG is stuck in demo land
Recall at k flattens even as you raise k
Retrieval logic is a thicket of rules
Citations mix up parties or rule versions
Token costs rise without quality gains
User definition of good does not match your dashboard
Latency drifts due to excessive context packing and retries
If two or more resonate, your bottleneck is likely retrieval, not the LLM.
Practical playbook
1) Choose your OMTM explicitly
In regulated domains, start with recall. Add precision and answer quality as measurable layers.
2) Train or adapt embeddings
Domain tuned embeddings beat generic ones on recall and coverage at a reasonable k.
3) Replace brittle rules with learned retrieval
Use domain signals and re rankers. Keep heuristics simple and reversible.
4) Own orchestration where it matters
Mirror expert workflows. Separate planning, retrieval, analysis, and citation. Instrument everything.
5) Watch the cost stack
Track k → tokens → latency → recall. If costs climb without recall moving, fix retrieval first.
6) Measure what users care about
Completeness, correct citations, separation of arguments, and traceability beat generic scores.
Explore More on AI with Ayushman
🧠 AI & Innovation
Ashish Vaswani: The Lesser-Known Titan Who Built the Future of AI
Anthropic’s Model Context Protocol (MCP): I’m Not Convinced Yet
💼 Leadership & Growth
Fighting Imposter Syndrome — Even After Building a Successful Career
The Story of an Average Engineer — How Perseverance Became Your Superpower
🌍 Life & Lessons
Read More Articles
We're constantly pushing the boundaries of what's possible and seeking new ways to improve our services.






