Multimodal Embeddings: One Model, One Vector Space

The problem with text-only RAG

If you've built a RAG pipeline, you know the drill. You take your documents, extract the text (OCR if you're unlucky, Docling if you're sophisticated), chunk it, embed it, store the vectors, and at query time you retrieve the relevant chunks and feed them to your LLM.

It works. Until it doesn't.

A client — an architecture studio — came to us with a problem that broke this entire model. They had thousands of photos of completed projects, floor plans in various formats, PDF specifications with complex layouts, and they wanted to search across all of it with natural language. "Show me open-plan kitchens with exposed brick from projects in Barcelona."

Try doing that with text extraction and chunking. You'll lose the visual information that makes the query meaningful in the first place.

Enter multimodal embeddings

One model, one vector space — for text, images, audio, video, and PDFs.

Multimodal embedding models like Voyage Multimodal 3 and Cohere Embed v4 can take any of these input types and project them into a shared vector space. That means a photo of an exposed-brick kitchen and the text "exposed brick kitchen" end up near each other in the embedding space.

This changes the architecture fundamentally. Instead of building separate pipelines for each modality, you get one corpus, one index, one query.

Two pipelines, compared

Text-based RAG

PDF

↓ Text extraction (OCR / Docling)

↓ Chunking

↓ Text embedding

↓ Retrieve chunks

↓ LLM generates answer

Multimodal pipeline

Text, PDF, Image, Audio, Video

↓ Multimodal embedding

↓ Unified vector index

↓ Retrieve mixed results

↓ LLM generates answer

fewer steps, richer results

The nuances nobody talks about

Before you rip out your existing RAG pipeline and go full multimodal, there are some important nuances I learned the hard way during this project:

Generation still needs text

Multimodal embeddings solve retrieval, not generation. When your LLM needs to synthesize an answer, it still works with text. So you're not eliminating text processing entirely — you're moving it downstream. The retrieval step becomes richer, but the generation step still requires extracting text from whatever you retrieved.

Granularity is different

Text-based RAG gives you chunk-level precision. You can retrieve the exact paragraph that answers a question. Multimodal embeddings typically work at page level or document level. For the architecture studio this was fine — they wanted to find the right photo or the right page. But for a legal document where the answer lives in a specific clause? You might still want text chunking.

Costs shift, they don't disappear

You save on text extraction and chunking complexity. But multimodal embedding models are more expensive per token than text-only models, and the vectors are often larger (meaning more storage and slower retrieval at scale). The economics work differently, not necessarily better.

When to use what

After this project, my mental model is straightforward:

Key takeaways

Mixed media corpus (photos, PDFs, plans) — go multimodal. Text-only RAG literally can't see your images.
Complex layouts (tables, diagrams, multi-column PDFs) — multimodal handles these better than OCR + chunking.
Pure text documents with fine-grained Q&A needs — text-based RAG still wins on precision and cost.
Hybrid approach — use multimodal for retrieval, then extract text from retrieved results for generation. Best of both worlds.

The bigger picture

What excites me most isn't the technology itself — it's what it enables. The architecture studio can now onboard a new designer and tell them: "just search for what you need." No tagging, no folder structure, no tribal knowledge about where things are stored.

Everything in one corpus. One index, one query.

That's a fundamentally different way of working with information. And it's just the beginning — as these models improve, the line between "searching your files" and "talking to your files" will keep blurring.

Discussion

Comments live on LinkedIn. Drop a thought, ask a question, or share your own take.

Comment on LinkedIn →

Originally published on

Read on LinkedIn →

Multimodal Embeddings:One Model, One Vector Space for Everything