Skip to content

Why everything we know about vector search is wrong

Author: Arun Kumar

During February, Someone contacted me on Linkedin and asked for some consultation in their RAG pipeline. Their fancy new ColBERT search system—the one that crushed benchmarks in testing—was melting their servers. Query latency jumped from 50ms to 5 seconds. “But the accuracy is so much better!” they kept saying, like a mantra, while customers fled to competitors. I suggested possible solutions and things got resolved in a balanced way.

That moment triggered my curiosity in this problem. Various companies have RAG implemented but they all struggle with a problem of fast and efficient retrieval.

In this article, we’ll discuss something nobody tells you about multi-vector search: it’s simultaneously the best and worst thing to happen to information retrieval. Recently Google just figured out how to fix it. We’ll explore their approach as well.

How we got here

Remember when BERT first dropped? We all got drunk on the idea of shoving entire documents into a single vector. Take your text, feed it through BERT, grab the [CLS] token, and boom—you’ve got a dense representation.

Simple

Fast

Wrong

Berth image

What’s wild is how badly this approach fails once you understand what’s happening. You’re taking War and Peace and saying “this book is about… uh… Russia?” It’s like describing the Mona Lisa as “a painting with a person.”

The math behind BERT’s similarity is deceptively simple:

score(q, d) = q · d = Σᵢ qᵢdᵢ

Just a dot product. Fast, but lossy as hell.

The problems hit you in production:

    • User searches for “python async debugging tips”
    • BERT vector matches documents about snakes, programming languages, and British comedy troupes equally well
    • The actual debugging tips? Buried on page 3
    • Your boss asks why the search sucks

So along comes ColBERT with a radical idea: keep ALL the token embeddings. Every. Single. One. Instead of one vector per document, you get hundreds. The similarity function becomes:

score(Q, D) = Σᵢ₌₁ⁿ maxⱼ₌₁ᵐ (qᵢ · dⱼ)

Where Q has n query tokens and D has m document tokens. It’s beautiful. It’s accurate.

It’s also computational suicide.

    •  

The dirty secret of multi-vector search

Here’s what the AI influencers don’t emphasize: ColBERT-style models are absolute monsters in production. The maths don’t lie. Single-vector search uses dot products—your CPU can do billions per second. Multi-vector search uses MaxSim scoring: for each query token, find its maximum similarity across all document tokens, then sum everything up. That’s not a dot product; that’s a matrix operation. For every. Single. Document.

The storage explodes. A million documents with BERT? That’s maybe 4GB of vectors. With ColBERT? Try 400GB. Hope you’ve got deep pockets for memory.

That’s where You lose all the speed tricks.

Those fancy approximate nearest neighbour algorithms that make vector search fast? They assume you’re comparing single points. Multi-vector similarity is non-linear. You’re back to brute force, like it’s 1999.

The “solutions” are embarrassing. PLAID tried to fix this by clustering tokens and building inverted indices. Still 10x slower than single-vector. Others use a two-stage approach: filter with BERT, re-rank with ColBERT. Except the documents ColBERT excels at finding are precisely the ones BERT filters out. Oops.

Google offered us MUVERA

June 2025, Google dropped MUVERA, and it’s one of those ideas that makes you angry you didn’t think of it first.

The insight is stupid simple: what if we could transform multi-vector sets into single vectors that preserve similarity relationships?

Not by averaging (that’s what we initially do), but through a principled mathematical transformation.

Muvera image

Here’s the trick:

  • Take your high-dimensional space and randomly chop it up with hyperplanes
  • For each chunk of space, allocate some dimensions in your output vector
  • For documents: average the tokens that fall in each chunk
  • For queries: sum the tokens in each chunk
  • The inner product of these “Fixed Dimensional Encodings” approximates the original multi-vector similarity

The mathematical foundation is elegant. Given a partition function π that maps vectors to partition indices:

FDE_q[i] = Σ_{v∈Q, π(v)=i} v

FDE_d[i] = (1/|{v∈D : π(v)=i}|) × Σ_{v∈D, π(v)=i} v

And magically:

⟨FDE_q, FDE_d⟩ ≈ Chamfer(Q, D)

What makes this genius is what it DOESN’T do. It doesn’t require training. Doesn’t need to see your data. Doesn’t care if you’re searching emails or academic papers. It just… works.

Real numbers from real systems

Let me break down what MUVERA actually delivers, because the paper undersells how game-changing this is:

Traditional BERT setup

• 50ms average query latency
• 70% recall@10 on BEIR benchmarks
• 4GB index for 1M docs
• Works with any vector database

ColBERT (with PLAID optimizations):

• 500ms average latency (on a good day)
• 85% recall@10
• 400GB index
• Requires custom infrastructure

MUVERA

• 55ms average latency • 83% recall@10
• 5GB index (or 150MB with compression!)
• Works with standard MIPS systems

The punchline? MUVERA is getting 95% of ColBERT’s accuracy gains at 10% of the computational cost. It’s not magic—it’s just good math.

Why this changes everything (and what it means for you)

What gets me excited is the meta-lesson here. We’ve been thinking about this problem wrong. Instead of building increasingly complex infrastructure for new model architectures, maybe we should be building better bridges between new models and existing systems.

The theoretical guarantee is what seals the deal. MUVERA proves that:

|⟨FDE_q, FDE_d⟩ – Chamfer(Q, D)| ≤ ε × Chamfer(Q, D)

With high probability for appropriate FDE dimensions. This isn’t hand-waving—it’s a provable approximation bound.

The alternatives people are chasing—custom silicon for matrix operations, learned indices, hybrid CPU/GPU systems—they’re all solving the wrong problem. MUVERA proves you can have your cake and eat it too, without reinventing the entire kitchen.

(Yes, there are edge cases where you need full ColBERT accuracy. If you’re Google Search, you probably run the full model. For the other 99.9% of us trying to build decent search? MUVERA can help.)

The future isn’t about choosing between simple-but-inaccurate or complex-but-slow. It’s about clever transformations that give us the best of both worlds. Sometimes the best engineering isn’t about building new systems—it’s about making existing systems do things they were never designed for.   

References & bibliography:

  1. MUVERA Paper
    • Title: “MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings”
    • Link: https://arxiv.org/abs/2405.19504
    • Authors: Mentioned as work by Majid Hadian, Jason Lee, and Vahab Mirrokni
  2. ColBERT
  3. PLAID
  4. BEIR Benchmarks
  5. Probabilistic Tree Embeddings

Technical concepts referenced

  1. BERT – For single-vector retrieval using [CLS] token embeddings
  2. MaxSim (Maximum Similarity) – The scoring function used in ColBERT
  3. Chamfer Similarity – The multi-vector similarity measure
  4. MIPS (Maximum Inner Product Search) – The optimization technique for vector search
  5. Product Quantization – For compressing FDEs

Tools/systems mentioned:

  1. FAISS, ScaNN – Vector database systems
  2. Pinecone, Weaviate, Qdrant – Commercial vector databases
  3. MS MARCO – Dataset where ColBERT shows 50-70% improvement

Other resources:

Recommended Posts

No comment yet, add your voice below!


Add a Comment

Your email address will not be published. Required fields are marked *