Apr 26, 2026
/
AI
How My Team Managed Reduce Our LLM Inference Cost By About 50%

How We Cut LLM Inference Costs by 50%
The AI industry has a massive inefficiency problem. Too often, development teams treat the context window like a dumping ground.
The prevailing assumption is that if you throw enough tokens at a Large Language Model (LLM), it will eventually reason its way to the right answer. But when models meet messy reality, stuffing a context window with noisy retrieval is a recipe for hallucinations and burned cash. Massive token consumption is not a flex. It is usually a symptom of a weak system.
At Jus Mundi, we approached this differently. By focusing on the root cause of retrieval noise, we managed to reduce token consumption by nearly half while simultaneously improving the quality of our answers. Here is how we did it.
Why Standard Retrieval Fails in Specialized Domains
Standard embedding models used in Retrieval Augmented Generation (RAG) pipelines retrieve information based on surface text similarity. This works for general knowledge, but it breaks down in highly specialized domains. In arbitration, for example, the exact same paragraph means something entirely different depending on the tribunal, the year, or the applicable treaty.
When a retrieval model is blind to that context, developers are forced to sacrifice precision to maintain recall. You end up over retrieving, pulling 300 chunks instead of 100, just hoping the LLM can act as a filter.
This creates a severe bottleneck. Every token spent on irrelevant material is a token unavailable for actual reasoning.
Fixing the Architecture: Jus AI Tenet v5
To fix this, we stopped trying to engineer prompts and changed the architecture. We recently shipped Jus AI Tenet v5, our proprietary embedding model built from the ground up for arbitration and international law.
Instead of relying purely on text strings, Tenet v5 encodes over 15 distinct metadata fields directly alongside the document context. We trained the model to map passages to a user's holistic legal intent rather than just matching textual patterns.
By feeding the LLM highly precise, context aware data, it no longer has to waste processing capacity on noise elimination.
The Downstream Impact
When you fix a system at its foundation, the downstream effects are massive. The metrics from our v5 deployment show exactly what happens when an LLM is given the right data:
46% reduction in overall token consumption.
30% reduction in the margin of error for deep research tasks.
10.2% average quality improvement across all metrics.
Because the LLM is not distracted by irrelevant context, its reasoning is more faithful, and its citations are exact.
Real Systems Over Hype
Owning your AI stack and aggressively reducing inference costs while driving up quality is the only way to build a sustainable AI business today. Hype does not pay the cloud computing bill. Real systems do.
Read More Articles
We're constantly pushing the boundaries of what's possible and seeking new ways to improve our services.





