← Back to writing

Saving Costs Using
Multimodal Embeddings

A follow-up to One Model, One Vector Space. Multimodal embedding bills get out of hand fast. Image and video tokens cost more than text, vectors are bigger, and every model upgrade multiplies everything. Here are the five savings that cut ours by 10x.

The hangover after the demo

In the previous post I wrote about how multimodal embeddings collapse text, images, and PDFs into one shared vector space. The architecture studio loved it. The retrieval was sharper. The pipeline was simpler. Everyone was happy.

Then we ran the math on rolling it out to ten more tenants and the smile went away. Embedding everything (every screenshot, every floor plan, every page of every spec) adds up fast. Multimodal vectors are bigger. Image and video tokens are pricier than text tokens. Re-embedding when you upgrade the model multiplies the bill all over again.

Multimodal retrieval isn't expensive because the model is bad. It's expensive because you embed too much, too often, and at full resolution.

Below are the five savings we ended up shipping. None of them require a different model. All of them are boring engineering. Together they cut our embedding bill by roughly an order of magnitude without measurably hurting retrieval quality.

1. Skip what you've already embedded

In any real corpus, the same image shows up over and over. The same screenshot is dragged into 40 support tickets. The same floor plan is attached to 12 RFPs. The same product photo is referenced from every catalogue page.

Before any image hits the embedding API, we hash it (a perceptual hash like pHash for near-duplicates, plain SHA-256 for exact ones) and check a small cache table: hash → embedding. If we've seen the bytes before, we reuse the vector and attach it to the new document. The API call never happens.

This sounds obvious right up until you measure it. On the architecture studio's corpus, roughly a third of all images were exact or near-exact duplicates. On a B2B SaaS support corpus we benchmarked later it was over half. That's a flat 30–50% discount before you do anything clever.

2. Summarize video and audio, then embed the text

Native multimodal video embedding is gorgeous and ruinous. You're paying per-frame, per-second, per-token, and you're storing very large vectors. For most retrieval use cases (think "find the meeting where we discussed the Barcelona project") you don't need frame-level fidelity. You need to know what the clip was about.

Native multimodal video
MP4
Multimodal embedding (per-frame)
Large vector(s) per clip
Index & query
accurate, very expensive
Summarize-then-embed
MP4
Cheap LLM transcribes & describes
Text embedding of the summary
Index & query
~100× cheaper, "good enough" for retrieval

We run a cheap model (Gemini Flash works well, any small captioning model does too) to produce a transcript plus a short visual description, concatenate them, and embed the result with the regular text embedder. The vector lives in the same index as everything else.

Cost drops by roughly two orders of magnitude versus native video embedding. The trade-off is real: you lose the ability to retrieve a specific frame. For "find the meeting" or "find the walkthrough" queries, nobody noticed.

3. Use shorter vectors when full precision isn't needed

Think of an embedding the way you'd think of a photo. The full-resolution version is huge, crisp enough to print on a billboard. A thumbnail is tiny, but it's still recognisable: you can tell at a glance whether it's the right photo. Most of the time, the thumbnail is all you actually need.

Modern embedding models work the same way. They produce a long vector (a list of about 3,000 numbers in the case of Gemini Embedding 2), but they're trained so that the first slice of those numbers is itself a complete, smaller embedding. You can keep just the first 768 numbers and throw the rest away. Same photo, smaller file.

For most search use cases, where the embedding's job is to narrow a few thousand candidates down to a shortlist that a smarter model then ranks, the shorter version is plenty. The index gets four times smaller. Every query gets faster. Storage drops by the same factor. We kept the full-resolution vector around only for the small slice of queries that actually need it.

Rule of thumb
  • Full vector (~3,000 numbers). Use when you need the highest precision, or when the corpus is small enough that storage doesn't matter.
  • Shorter vector (768 numbers). The sensible default for filter-and-rerank pipelines. About 4 times cheaper to store and query.
  • Tiny vector (256 numbers). For edge or on-device use cases where every byte counts.

4. Use batch mode for backfills

Sooner or later you'll need to re-embed everything. A new model comes out. You change the summarization prompt. A tenant migrates in with two years of back-catalogue. Doing this on the live, low-latency endpoint is the most expensive way possible.

Most providers (Gemini, Voyage, OpenAI) offer a batch mode at roughly 50% off with a 24-hour SLA. For backfills nobody is waiting on, that's a free halving of the bill. We route any non-interactive job (model upgrades, nightly re-ingests, tenant onboarding) through a low-priority queue that always uses batch.

The only thing to get right is idempotency: a batch job needs to be safely re-runnable, because sooner or later one will fail halfway and you'll want to retry just the missing rows.

5. Default off, opt in per tenant

The cheapest saving of all is the one that's free: don't turn the feature on for tenants who don't need it.

Most of our tenants don't have a multimodal corpus. They have invoices, POs, and emails. Plain text embeddings cover them perfectly. Defaulting multimodal off, and only switching it on when a customer says "we have tons of diagrams" or "we need to search our photo library", eliminates a huge fraction of the cost we were quietly absorbing.

This is also where the cost story becomes a product story: multimodal goes from a line on the infra bill to a feature flag the sales team can talk about.

What we ended up with

The full stack
  • Hash & cache every image before calling the embedder. ~30–50% discount, basically free.
  • Summarize & text-embed video and audio with a cheap captioner. ~100× cheaper than native multimodal video.
  • Truncate to 768 dims for filter-and-rerank. ~4× cheaper storage and query.
  • Batch mode for any backfill or re-ingest. Flat 50% off the line item.
  • Per-tenant opt-in so you only pay for the tenants who actually benefit.

None of this is novel. None of it required a research paper. It's the kind of thing you only think to do once you've stared at a real bill long enough. The lesson I keep relearning is that the model is rarely the bottleneck. What matters is how much you call it, on what, and whether you cached the answer last time.

Embed less. Embed smaller. Embed cheaper. Embed only when someone asked.

Discussion

Comments live on LinkedIn. Got a sixth saving? Drop it there.

Comment on LinkedIn →