← Back to writing

Multimodal Embeddings:
One Model, One Vector Space for Everything

How we built a unified search engine across photos, floor plans, and PDF specs for an architecture studio — and what it taught me about the future of RAG.

The problem with text-only RAG

If you've built a RAG pipeline, you know the drill. You take your documents, extract the text (OCR if you're unlucky, Docling if you're sophisticated), chunk it, embed it, store the vectors, and at query time you retrieve the relevant chunks and feed them to your LLM.

It works. Until it doesn't.

A client — an architecture studio — came to us with a problem that broke this entire model. They had thousands of photos of completed projects, floor plans in various formats, PDF specifications with complex layouts, and they wanted to search across all of it with natural language. "Show me open-plan kitchens with exposed brick from projects in Barcelona."

Try doing that with text extraction and chunking. You'll lose the visual information that makes the query meaningful in the first place.

Enter multimodal embeddings

One model, one vector space — for text, images, audio, video, and PDFs.

Multimodal embedding models like Voyage Multimodal 3 and Cohere Embed v4 can take any of these input types and project them into a shared vector space. That means a photo of an exposed-brick kitchen and the text "exposed brick kitchen" end up near each other in the embedding space.

This changes the architecture fundamentally. Instead of building separate pipelines for each modality, you get one corpus, one index, one query.

Two pipelines, compared

Text-based RAG
PDF
Text extraction (OCR / Docling)
Chunking
Text embedding
Retrieve chunks
LLM generates answer
Multimodal pipeline
Text, PDF, Image, Audio, Video
Multimodal embedding
Unified vector index
Retrieve mixed results
LLM generates answer
fewer steps, richer results

The nuances nobody talks about

Before you rip out your existing RAG pipeline and go full multimodal, there are some important nuances I learned the hard way during this project:

Generation still needs text

Multimodal embeddings solve retrieval, not generation. When your LLM needs to synthesize an answer, it still works with text. So you're not eliminating text processing entirely — you're moving it downstream. The retrieval step becomes richer, but the generation step still requires extracting text from whatever you retrieved.

Granularity is different

Text-based RAG gives you chunk-level precision. You can retrieve the exact paragraph that answers a question. Multimodal embeddings typically work at page level or document level. For the architecture studio this was fine — they wanted to find the right photo or the right page. But for a legal document where the answer lives in a specific clause? You might still want text chunking.

Costs shift, they don't disappear

You save on text extraction and chunking complexity. But multimodal embedding models are more expensive per token than text-only models, and the vectors are often larger (meaning more storage and slower retrieval at scale). The economics work differently, not necessarily better.

When to use what

After this project, my mental model is straightforward:

Key takeaways
  • Mixed media corpus (photos, PDFs, plans) — go multimodal. Text-only RAG literally can't see your images.
  • Complex layouts (tables, diagrams, multi-column PDFs) — multimodal handles these better than OCR + chunking.
  • Pure text documents with fine-grained Q&A needs — text-based RAG still wins on precision and cost.
  • Hybrid approach — use multimodal for retrieval, then extract text from retrieved results for generation. Best of both worlds.

The bigger picture

What excites me most isn't the technology itself — it's what it enables. The architecture studio can now onboard a new designer and tell them: "just search for what you need." No tagging, no folder structure, no tribal knowledge about where things are stored.

Everything in one corpus. One index, one query.

That's a fundamentally different way of working with information. And it's just the beginning — as these models improve, the line between "searching your files" and "talking to your files" will keep blurring.