Nov 10, 2025

/

AI Advice

Every AI team hits a wall. Ours had ‘RAG’ written all over it.

RAG Hit a Wall - How We Rebuilt Legal AI to Actually Scale

Every AI team hits a wall. Ours had RAG written all over it.

When I joined Jus Mundi, our first Legal AI worked and delivered value. But as we expanded to more real legal workflows, it stopped scaling. The stack looked perfect on paper, closely following the well known Retrieval-Augmented Generation for Large Language Models: A Survey - and that was exactly the problem. To understand why, it helps to remember what RAG promises in theory and where it fails in practice for high precision domains like law. For an overview of the RAG paradigm and evaluation methods, see the latest arXiv surveys. (arXiv)

Where textbook RAG broke for us

1) No clear OMTM - One Metric That Matters
Were we optimizing for precision, recall, or answer quality. Trying to maximize everything meant we optimized nothing the way lawyers define good.

2) Metric saturation
Raising top k should lift recall. If recall flattens even as k increases, your pipeline has a structural ceiling that no LLM swap will fix.

3) Over engineering retrieval
Legal retrieval needs domain adaptation. We drifted into brittle rules and if elses - the symbolic AI trap. Domain logic must be learned, not hardcoded.

4) The lawyer problem
Legal language is unforgiving. Mixing rule versions or confusing parties kills credibility. Citations are a minefield. Even great general purpose models make these mistakes without better retrieval and orchestration.

This is why RAG demos can shine while production systems stall. Industry wide excitement about RAG is justified - but adoption at scale still demands domain specific choices. (TIME)

The 4 step rebuild that worked

1) Pick an OMTM - we chose recall

In law, missing a key precedent is not an option. We anchored on recall first, then layered precision and answer quality as second order effects. If you cannot reliably retrieve everything that matters, nothing downstream can rescue the answer.

What we measured: recall at k curves, coverage versus token cost, citation faithfulness, and user perceived completeness.

2) Rebuild embeddings for arbitration and legal text

Off the shelf embeddings saturated too early. We trained for legal semantics so we could push k higher without drowning the LLM in noise. That unlocked richer context while keeping token costs in check.

Heuristic to track: k → tokens → latency → recall. If recall does not move as k increases, fix embeddings and retrieval before touching the model.

For background on RAG components and evaluation, see the arXiv survey synthesis. (arXiv)

3) De symbolize retrieval

Instead of hand written rules, we moved to learned selection guided by legal signals like document type, parties, procedural posture, dates, and clause versions. Keep heuristics small, auditable, and backed by data.

4) Build a proprietary multi agent system that thinks like a lawyer

We designed orchestration that mirrors how arbitration lawyers actually work: a planner to decompose the question, a researcher to run targeted passes, an analyst to separate arguments and versions, and a citer to ground conclusions with precise, checkable references.

We also removed abstraction layers that added complexity without value for our domain. For teams relying on generic agent frameworks, evaluate the tradeoffs carefully and read the platform’s own positioning about agents and orchestration to decide what you should own. See the official LangChain site and repository for context. (LangChain)

Results in under four months

  • Approximately 125 percent higher recall versus our previous stack

  • Significant gains in citation faithfulness and argument separation

  • More use cases unlocked without runaway token costs thanks to better retrieval and tighter orchestration

Checklist - signs your RAG is stuck in demo land

  • Recall at k flattens even as you raise k

  • Retrieval logic is a thicket of rules

  • Citations mix up parties or rule versions

  • Token costs rise without quality gains

  • User definition of good does not match your dashboard

  • Latency drifts due to excessive context packing and retries

If two or more resonate, your bottleneck is likely retrieval, not the LLM.

Practical playbook

1) Choose your OMTM explicitly
In regulated domains, start with recall. Add precision and answer quality as measurable layers.

2) Train or adapt embeddings
Domain tuned embeddings beat generic ones on recall and coverage at a reasonable k.

3) Replace brittle rules with learned retrieval
Use domain signals and re rankers. Keep heuristics simple and reversible.

4) Own orchestration where it matters
Mirror expert workflows. Separate planning, retrieval, analysis, and citation. Instrument everything.

5) Watch the cost stack
Track k → tokens → latency → recall. If costs climb without recall moving, fix retrieval first.

6) Measure what users care about
Completeness, correct citations, separation of arguments, and traceability beat generic scores.

Explore More on AI with Ayushman

🧠 AI & Innovation

💼 Leadership & Growth

🌍 Life & Lessons

Read More Articles

We're constantly pushing the boundaries of what's possible and seeking new ways to improve our services.

Ayushman Dash

Nov 15, 2025

AI History

Highlights Aakanksha Chowdhery’s impact on Pathways, PaLM, and large-scale AI systems, showcasing the engineering behind today’s most powerful models.

Ayushman Dash

Nov 15, 2025

AI History

Highlights Aakanksha Chowdhery’s impact on Pathways, PaLM, and large-scale AI systems, showcasing the engineering behind today’s most powerful models.

Ayushman Dash

Nov 15, 2025

AI History

Highlights Aakanksha Chowdhery’s impact on Pathways, PaLM, and large-scale AI systems, showcasing the engineering behind today’s most powerful models.

Ayushman Dash

Nov 15, 2025

AI History

Highlights Aakanksha Chowdhery’s impact on Pathways, PaLM, and large-scale AI systems, showcasing the engineering behind today’s most powerful models.

Ayushman Dash

Nov 11, 2025

AI Advice

Learn how to use AI tools and smart prompts for AI interview preparation while building real depth, discipline, and hands-on mastery in artificial intelligence.

Ayushman Dash

Nov 11, 2025

AI Advice

Learn how to use AI tools and smart prompts for AI interview preparation while building real depth, discipline, and hands-on mastery in artificial intelligence.

Ayushman Dash

Nov 11, 2025

AI Advice

Learn how to use AI tools and smart prompts for AI interview preparation while building real depth, discipline, and hands-on mastery in artificial intelligence.

Ayushman Dash

Nov 11, 2025

AI Advice

Learn how to use AI tools and smart prompts for AI interview preparation while building real depth, discipline, and hands-on mastery in artificial intelligence.

Ayushman Dash

Nov 10, 2025

AI Advice

RAG looked great on paper—until we tried to scale. Here’s how Jus Mundi rebuilt its Legal AI for recall, reliability, and cost at real‑world scale.

Ayushman Dash

Nov 10, 2025

AI Advice

RAG looked great on paper—until we tried to scale. Here’s how Jus Mundi rebuilt its Legal AI for recall, reliability, and cost at real‑world scale.

Ayushman Dash

Nov 10, 2025

AI Advice

RAG looked great on paper—until we tried to scale. Here’s how Jus Mundi rebuilt its Legal AI for recall, reliability, and cost at real‑world scale.

Ayushman Dash

Nov 10, 2025

AI Advice

RAG looked great on paper—until we tried to scale. Here’s how Jus Mundi rebuilt its Legal AI for recall, reliability, and cost at real‑world scale.

Ready to lead with confidence in an AI-driven world?

Copyright © 2024 AI with Ayushman. All Rights Reserved

Social

Ready to lead with confidence in an AI-driven world?

Copyright © 2023 Techty. All Rights Reserved

Social

Ready to lead with confidence in an AI-driven world?

Copyright © 2023 Techty. All Rights Reserved

Social