What AI software development services does DevOn offer in the Netherlands?

DevOn offers AI-powered software development, quality engineering, and consulting services in the Netherlands and Europe. Our services include AI integration, generative AI solutions, machine learning models, intelligent automation, and AI-driven QA to help companies scale faster and innovate confidently.

How can DevOn’s AI teams improve software delivery for European companies?

Our AI-trained development teams help European companies streamline software delivery by accelerating release cycles, reducing bugs with AI-enhanced testing, and using data-driven decision-making to improve product performance and quality.

Why choose DevOn as your AI development partner in Europe?

With delivery centers in India and on-site experts in the Netherlands, DevOn offers the perfect balance of cost-efficiency and deep AI expertise. Our nearshore-offshore model ensures speed, transparency, and control—ideal for European startups and scaleups.

What industries does DevOn serve with AI solutions?

DevOn works with SaaS companies, financial services, retail, utilities, and manufacturing industries across Europe. We build tailored AI solutions to solve domain-specific challenges in compliance, product engineering, automation, and decision intelligence.

Can DevOn help my company implement Generative AI solutions?

Yes! DevOn specializes in applying Generative AI to real business use cases such as code generation, intelligent document processing, customer support automation, and data analysis. We help European businesses turn GenAI into a competitive advantage.

Why everything we know about vector search is wrong

Author: Arun Kumar

During February, Someone contacted me on Linkedin and asked for some consultation in their RAG pipeline. Their fancy new ColBERT search system—the one that crushed benchmarks in testing—was melting their servers. Query latency jumped from 50ms to 5 seconds. “But the accuracy is so much better!” they kept saying, like a mantra, while customers fled to competitors. I suggested possible solutions and things got resolved in a balanced way.

That moment triggered my curiosity in this problem. Various companies have RAG implemented but they all struggle with a problem of fast and efficient retrieval.

In this article, we’ll discuss something nobody tells you about multi-vector search: it’s simultaneously the best and worst thing to happen to information retrieval. Recently Google just figured out how to fix it. We’ll explore their approach as well.

How we got here

Remember when BERT first dropped? We all got drunk on the idea of shoving entire documents into a single vector. Take your text, feed it through BERT, grab the [CLS] token, and boom—you’ve got a dense representation.

Simple

Fast

Wrong

What’s wild is how badly this approach fails once you understand what’s happening. You’re taking War and Peace and saying “this book is about… uh… Russia?” It’s like describing the Mona Lisa as “a painting with a person.”

The math behind BERT’s similarity is deceptively simple:

score(q, d) = q · d = Σᵢ qᵢdᵢ

Just a dot product. Fast, but lossy as hell.

The problems hit you in production:

- User searches for “python async debugging tips”
- BERT vector matches documents about snakes, programming languages, and British comedy troupes equally well
- The actual debugging tips? Buried on page 3
- Your boss asks why the search sucks

So along comes ColBERT with a radical idea: keep ALL the token embeddings. Every. Single. One. Instead of one vector per document, you get hundreds. The similarity function becomes:

score(Q, D) = Σᵢ₌₁ⁿ maxⱼ₌₁ᵐ (qᵢ · dⱼ)

Where Q has n query tokens and D has m document tokens. It’s beautiful. It’s accurate.

It’s also computational suicide.

The dirty secret of multi-vector search

Here’s what the AI influencers don’t emphasize: ColBERT-style models are absolute monsters in production. The maths don’t lie. Single-vector search uses dot products—your CPU can do billions per second. Multi-vector search uses MaxSim scoring: for each query token, find its maximum similarity across all document tokens, then sum everything up. That’s not a dot product; that’s a matrix operation. For every. Single. Document.

The storage explodes. A million documents with BERT? That’s maybe 4GB of vectors. With ColBERT? Try 400GB. Hope you’ve got deep pockets for memory.

That’s where You lose all the speed tricks.

Those fancy approximate nearest neighbour algorithms that make vector search fast? They assume you’re comparing single points. Multi-vector similarity is non-linear. You’re back to brute force, like it’s 1999.

The “solutions” are embarrassing. PLAID tried to fix this by clustering tokens and building inverted indices. Still 10x slower than single-vector. Others use a two-stage approach: filter with BERT, re-rank with ColBERT. Except the documents ColBERT excels at finding are precisely the ones BERT filters out. Oops.

Google offered us MUVERA

June 2025, Google dropped MUVERA, and it’s one of those ideas that makes you angry you didn’t think of it first.

The insight is stupid simple: what if we could transform multi-vector sets into single vectors that preserve similarity relationships?

Not by averaging (that’s what we initially do), but through a principled mathematical transformation.

Here’s the trick:

Take your high-dimensional space and randomly chop it up with hyperplanes
For each chunk of space, allocate some dimensions in your output vector
For documents: average the tokens that fall in each chunk
For queries: sum the tokens in each chunk
The inner product of these “Fixed Dimensional Encodings” approximates the original multi-vector similarity

The mathematical foundation is elegant. Given a partition function π that maps vectors to partition indices:

FDE_q[i] = Σ_{v∈Q, π(v)=i} v

FDE_d[i] = (1/|{v∈D : π(v)=i}|) × Σ_{v∈D, π(v)=i} v

And magically:

⟨FDE_q, FDE_d⟩ ≈ Chamfer(Q, D)

What makes this genius is what it DOESN’T do. It doesn’t require training. Doesn’t need to see your data. Doesn’t care if you’re searching emails or academic papers. It just… works.

Real numbers from real systems

Let me break down what MUVERA actually delivers, because the paper undersells how game-changing this is:

Traditional BERT setup

• 50ms average query latency
• 70% recall@10 on BEIR benchmarks
• 4GB index for 1M docs
• Works with any vector database

ColBERT (with PLAID optimizations):

• 500ms average latency (on a good day)
• 85% recall@10
• 400GB index
• Requires custom infrastructure

MUVERA

• 55ms average latency • 83% recall@10
• 5GB index (or 150MB with compression!)
• Works with standard MIPS systems

The punchline? MUVERA is getting 95% of ColBERT’s accuracy gains at 10% of the computational cost. It’s not magic—it’s just good math.

Why this changes everything (and what it means for you)

What gets me excited is the meta-lesson here. We’ve been thinking about this problem wrong. Instead of building increasingly complex infrastructure for new model architectures, maybe we should be building better bridges between new models and existing systems.

The theoretical guarantee is what seals the deal. MUVERA proves that:

|⟨FDE_q, FDE_d⟩ – Chamfer(Q, D)| ≤ ε × Chamfer(Q, D)

With high probability for appropriate FDE dimensions. This isn’t hand-waving—it’s a provable approximation bound.

The alternatives people are chasing—custom silicon for matrix operations, learned indices, hybrid CPU/GPU systems—they’re all solving the wrong problem. MUVERA proves you can have your cake and eat it too, without reinventing the entire kitchen.

(Yes, there are edge cases where you need full ColBERT accuracy. If you’re Google Search, you probably run the full model. For the other 99.9% of us trying to build decent search? MUVERA can help.)

The future isn’t about choosing between simple-but-inaccurate or complex-but-slow. It’s about clever transformations that give us the best of both worlds. Sometimes the best engineering isn’t about building new systems—it’s about making existing systems do things they were never designed for.

References & bibliography:

MUVERA Paper
- Title: “MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings”
- Link: https://arxiv.org/abs/2405.19504
- Authors: Mentioned as work by Majid Hadian, Jason Lee, and Vahab Mirrokni
ColBERT
- Title: ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
- Link mentioned: https://arxiv.org/abs/2004.12832
PLAID
- Paper: https://arxiv.org/pdf/2205.09707
- Described as ColBERT’s optimization attempt using clustering and inverted indices
BEIR Benchmarks
- Paper: https://arxiv.org/abs/2104.08663
- Used for evaluating information retrieval performance
Probabilistic Tree Embeddings
- Paper: https://arxiv.org/abs/2111.03528
- Mentioned as theoretical foundation for MUVERA’s approach

Technical concepts referenced

BERT – For single-vector retrieval using [CLS] token embeddings
MaxSim (Maximum Similarity) – The scoring function used in ColBERT
Chamfer Similarity – The multi-vector similarity measure
MIPS (Maximum Inner Product Search) – The optimization technique for vector search
Product Quantization – For compressing FDEs

Tools/systems mentioned:

FAISS, ScaNN – Vector database systems
Pinecone, Weaviate, Qdrant – Commercial vector databases
MS MARCO – Dataset where ColBERT shows 50-70% improvement

Other resources:

GitHub Implementation
- MUVERA’s open-source implementation: https://github.com/google/graph-mining/tree/main/sketching/point_cloud

Wikipedia Links (from the fetched content):
- Information retrieval: https://en.wikipedia.org/wiki/Information_retrieval
- Inner product space: https://en.wikipedia.org/wiki/Inner_product_space#Euclidean_vector_space
- Maximum inner product search: https://en.wikipedia.org/wiki/Maximum_inner-product_search
- Dot product: https://en.wikipedia.org/wiki/Dot_product
- Space partitioning: https://en.wikipedia.org/wiki/Space_partitioning

AI, AI Agents, AI blog, artificial intelligence

Why everything we know about vector search is wrong

How we got here

The dirty secret of multi-vector search

Google offered us MUVERA

Real numbers from real systems

Traditional BERT setup

ColBERT (with PLAID optimizations):

MUVERA

Why this changes everything (and what it means for you)

References & bibliography:

Technical concepts referenced

Tools/systems mentioned:

Other resources:

Recommended Posts

Querying Data with Natural Language

Why AI Agents Must Talk: A Beginner’s Guide to A2A Protocols

Old tricks, new name: Mastering context for AI agents

Add a Comment Cancel reply

Contact Us

Explore

Offshore & Nearshore

Contact

Office Locations