← Back to writing

Hybrid BM25 + Vector
Retrieval for SAP HANA

Vector search finds meaning. BM25 finds exact matches. langchain-hana-retriever gives you both — with Reciprocal Rank Fusion to merge them.

The gap

If you're doing RAG on SAP HANA Cloud, you've got vector search through langchain-hanadb. But no BM25. No keyword ranking. No hybrid retrieval.

That matters more than you think. When someone asks about "article 12" or "HR-LIQUI-2024/0003", semantic search returns vaguely related content. It understands the general topic but misses the exact reference. You need keyword matching for precision and vector search for understanding.

langchain-hana-retriever

I built langchain-hana-retriever to close this gap. It provides two LangChain-compatible retrievers that work directly on your HANA Cloud instance:

HANABm25Retriever

Uses LOCATE on NCLOB columns to fetch keyword candidates, then scores with BM25Okapi in Python. Pure keyword search with proper term frequency scoring.

HANAHybridRetriever

Combines vector and BM25 search, merging results via Reciprocal Rank Fusion. Best of both worlds — semantic understanding plus exact matches.

Installation pip install langchain-hana-retriever
# For hybrid search
pip install "langchain-hana-retriever[hybrid]"

Why hybrid matters

Vector search finds meaning, BM25 finds exact matches. Hybrid gives you both — reranked and merged.

In practice, this makes a huge difference for enterprise documents. Financial regulations reference specific article numbers. HR documents have internal codes. Contract clauses need exact matching. Pure vector search will get you "in the neighborhood" but miss the specific reference.

The hybrid retriever runs both searches in parallel, then uses Reciprocal Rank Fusion to merge the ranked results. Documents that score well on both semantic similarity and keyword relevance bubble to the top.

Key points
  • Drop-in LangChain retriever — works with any LangChain chain or agent
  • No extra infrastructure — runs on your existing HANA Cloud instance
  • Reciprocal Rank Fusion — intelligently merges vector + keyword results
  • BM25Okapi scoring — proper term frequency ranking, not just substring matching
  • MIT licensed — PRs welcome