← Back to writing

Semantic Caching
for SAP HANA Cloud

How langchain-hana-cache cuts LLM latency by 96% and eliminates redundant API calls — using the vector engine you already have.

The problem

If you're building AI agents, you've probably noticed something: users ask the same questions over and over, just worded differently. "How do I reset my password?", "I need help with my password", "password reset process" — they all want the same answer.

But every single one of those triggers a full LLM API call. That's tokens spent, latency added, and money burned for something you've already answered.

This is one of those problems that seems small until you look at the numbers in production. A customer service agent handling hundreds of conversations a day, a document Q&A system where people keep asking about the same regulations, an internal help desk where the top 20 questions cover 80% of the traffic. Every repeated question is a wasted and expensive API call.

Semantic caching changes this

Instead of matching prompts word by word, semantic caching understands that "how do I reset my password" and "I forgot my password, help" mean the same thing. When a similar enough question comes in, it returns the answer you already have, instantly. No API call, no tokens, no waiting.

Store the response once, serve it forever — or until you decide it should expire.

langchain-hana-cache

I built and released langchain-hana-cache, an open-source package that brings semantic caching to SAP HANA Cloud. It uses the same vector similarity engine that already powers RAG retrieval in HANA, but instead of matching documents, it matches prompts.

Installation pip install langchain-hana-cache

It integrates with LangChain in two lines of code. The cache table lives in the same HANA instance where your business data already sits — no additional infrastructure to manage, no Redis to set up, no separate vector database. If you're already on BTP with HANA Cloud, you're ready to go.

The results

21x
Faster on cache hits
96%
Latency reduction
0
Tokens on cache hit

That's going from 8 seconds down to under half a second. And since the LLM never gets called on a cache hit, you consume zero tokens. For high-traffic agents on expensive models like Claude Opus or GPT-5, the savings add up fast.

Key points
  • Uses HANA's vector similarity engine — the same one powering your RAG retrieval
  • Two lines of LangChain integration — drop-in replacement for existing cache
  • No extra infrastructure — lives in your existing HANA instance
  • Configurable similarity threshold and TTL expiration
  • MIT licensed, PRs welcome