The gap
If you're doing RAG on SAP HANA Cloud, you've got vector search through
langchain-hanadb. But no BM25. No keyword ranking. No hybrid retrieval.
That matters more than you think. When someone asks about "article 12" or "HR-LIQUI-2024/0003", semantic search returns vaguely related content. It understands the general topic but misses the exact reference. You need keyword matching for precision and vector search for understanding.
langchain-hana-retriever
I built langchain-hana-retriever to close this gap. It provides two
LangChain-compatible retrievers that work directly on your HANA Cloud instance:
Uses LOCATE on NCLOB columns to fetch keyword candidates,
then scores with BM25Okapi in Python. Pure keyword search with proper
term frequency scoring.
Combines vector and BM25 search, merging results via Reciprocal Rank Fusion. Best of both worlds — semantic understanding plus exact matches.
# For hybrid search
pip install "langchain-hana-retriever[hybrid]"
Why hybrid matters
Vector search finds meaning, BM25 finds exact matches. Hybrid gives you both — reranked and merged.
In practice, this makes a huge difference for enterprise documents. Financial regulations reference specific article numbers. HR documents have internal codes. Contract clauses need exact matching. Pure vector search will get you "in the neighborhood" but miss the specific reference.
The hybrid retriever runs both searches in parallel, then uses Reciprocal Rank Fusion to merge the ranked results. Documents that score well on both semantic similarity and keyword relevance bubble to the top.
- Drop-in LangChain retriever — works with any LangChain chain or agent
- No extra infrastructure — runs on your existing HANA Cloud instance
- Reciprocal Rank Fusion — intelligently merges vector + keyword results
- BM25Okapi scoring — proper term frequency ranking, not just substring matching
- MIT licensed — PRs welcome