The Mathematics of Vector Retrieval: From Euclidean Distance to Approximate Nearest Neighbors

A rigorous exploration of the mathematical foundations powering modern vector search systems, including distance metrics, similarity functions, and ANN algorithms that enable semantic search at scale.

GM
Gaurav Malhotra
January 21, 202415 min readView on GitHub
PythonFAISSNumPy

Vector retrieval lies at the heart of modern AI systems, from semantic search engines to recommendation systems and retrieval-augmented generation (RAG) pipelines. Understanding the mathematical structures that power these systems is essential for any ML practitioner. This deep dive explores the rigorous mathematics behind vector search, drawing from the Vectors project which demonstrates these principles in practice.

The Vector Retrieval Problem: A Formal Definition

At its core, vector retrieval is an optimization problem. Given a collection of N vectors in a d-dimensional space:

X = {x_1, x_2, ..., x_N} ⊂ ℝ^d

and a query vector q ∈ ℝ^d, we seek to find the k vectors from X that are "closest" to q according to some distance function Δ.

Formally, we solve:

argmin(u ∈ X) Δ(q, u)_(k)

where the subscript (k) denotes selection of the k vectors minimizing the distance. This elegant formulation encompasses everything from k-nearest neighbors to maximum inner product search.

Vector Representations: Encoding Semantics in Geometry

Real-world objects---documents, images, audio---are transformed into dense vector representations where geometric proximity encodes semantic similarity. Each dimension captures a learned feature, and the value represents that feature's strength.

Consider a document vector where each dimension corresponds to semantic concepts. Two documents discussing similar topics will occupy nearby regions in this high-dimensional space, enabling retrieval through geometric operations.

RAG Architecture

Loading diagram...

Distance Functions: The Mathematical Heart of Retrieval

The choice of distance function Δ fundamentally shapes retrieval behavior. Each metric captures different notions of similarity, and understanding their mathematical properties is crucial for system design.

Euclidean Distance (L2 Norm)

The most intuitive metric, Euclidean distance measures the straight-line distance between two points in ℝ^d:

Δ_Euclidean(u, v) = ||u - v||_2 = √(Σ(i=1 to d) (u_i - v_i)²)

Properties:

  • Metric space axioms: Satisfies non-negativity, identity of indiscernibles, symmetry, and triangle inequality
  • Computational note: Often we minimize ||u - v||_2² to avoid the square root computation
  • Geometric interpretation: Measures the magnitude of the difference vector u - v

The squared Euclidean distance can be expanded:

||u - v||_2² = ||u||_2² + ||v||_2² - 2⟨u, v⟩

This expansion reveals the relationship between L2 distance and inner products, which becomes important for optimization.

Cosine Similarity and Angular Distance

When vector magnitudes are irrelevant and only direction matters, cosine similarity provides a normalized measure:

cos_sim(u, v) = ⟨u, v⟩ / (||u||_2 · ||v||_2) = cos(θ)

where θ is the angle between vectors. The corresponding angular distance is:

Δ_Angular(u, v) = 1 - ⟨u, v⟩ / (||u||_2 · ||v||_2) = 1 - cos(θ)

Properties:

  • Range: cos_sim ∈ [-1, 1], with 1 indicating identical directions
  • Normalization invariant: cos_sim(αu, βv) = cos_sim(u, v) for α, β greater than 0
  • Unit sphere equivalence: For normalized vectors ||û|| = ||v̂|| = 1, cosine similarity equals the inner product

Important relationship: For unit-normalized vectors:

||u - v||_2² = 2(1 - cos(θ)) = 2 · Δ_Angular(u, v)

This means L2 distance on normalized vectors is monotonically related to angular distance.

Inner Product (Dot Product)

The inner product provides an unnormalized similarity measure:

⟨u, v⟩ = Σ(i=1 to d) u_i · v_i = ||u||_2 · ||v||_2 · cos(θ)

For retrieval, we define:

Δ_IP(u, v) = -⟨u, v⟩

The negation converts the maximization problem (Maximum Inner Product Search, MIPS) into minimization.

Properties:

  • Not a metric: Violates triangle inequality and non-negativity
  • Magnitude-sensitive: Unlike cosine, considers both direction and magnitude
  • Application: Used when vector norms carry meaning (e.g., item popularity in recommendations)

Metric Comparison Summary

MetricFormulaNormalized?Metric Space?Primary Use Case
Euclidean||u-v||_2NoYeskNN, clustering
Cosine1 - cos(θ)YesNo*Text similarity
Inner Product-⟨u,v⟩NoNoMIPS, recommendations

*Angular distance satisfies metric properties when converted appropriately.

The Curse of Dimensionality

In high-dimensional spaces, exact nearest neighbor search becomes computationally intractable. This phenomenon, known as the curse of dimensionality, manifests in several ways:

  1. Distance concentration: All pairwise distances converge to similar values as d → ∞
  2. Exponential volume growth: The volume of a hypersphere grows as r^d, requiring exponentially more data to maintain density
  3. Search complexity: Exact algorithms degrade to linear scan complexity

The concentration of distances can be formalized. For random vectors, as dimensionality increases:

(max(u ∈ X) ||q - u||_2 - min(u ∈ X) ||q - u||_2) / min(u ∈ X) ||q - u||_2 → 0

This motivates the transition from exact to approximate methods.

Approximate Nearest Neighbors (ANN)

Given the computational barriers to exact search, we accept controlled approximation. An algorithm provides (1+ε)-approximate nearest neighbors if:

Δ(q, u) ≤ (1 + ε) · Δ(q, u*)

where u* is the true nearest neighbor and ε greater than 0 is the approximation factor.

RAG Architecture

Loading diagram...

Locality-Sensitive Hashing (LSH)

LSH provides probabilistic guarantees through hash functions that preserve locality:

P[h(u) = h(v)] = f(Δ(u, v))

where f is monotonically decreasing. Similar vectors hash to the same bucket with high probability.

For cosine similarity, random hyperplane LSH uses:

h_r(u) = sign(⟨r, u⟩)

where r is a random unit vector. The collision probability is:

P[h_r(u) = h_r(v)] = 1 - θ(u,v)/π

Hierarchical Navigable Small World (HNSW)

HNSW constructs a multi-layer graph where:

  • Upper layers contain long-range connections (coarse navigation)
  • Lower layers contain local connections (fine search)

Search complexity is O(log N) with high probability, achieved through the small-world property of the graph structure.

Product Quantization (PQ)

PQ compresses vectors by decomposing ℝ^d into M subspaces:

u = [u^(1), u^(2), ..., u^(M)]

Each subspace u^(m) ∈ ℝ^(d/M) is quantized to its nearest centroid from a learned codebook. Distance computation uses precomputed lookup tables:

||q - u||_2² ≈ Σ(m=1 to M) ||q^(m) - c_km^(m)||_2²

where c_km^(m) is the centroid assigned to u^(m).

Inverted File Index (IVF)

IVF partitions the vector space into Voronoi cells using k-means clustering. At query time, only vectors in nearby cells are searched:

  1. Find the n_probe nearest cluster centroids to q
  2. Search only vectors assigned to those clusters
  3. Return top-k from the reduced candidate set

Facebook AI Similarity Search (FAISS) implements these algorithms efficiently for billion-scale datasets. Key index types include:

import faiss
import numpy as np

d = 128  # Dimension
n = 1000000  # Database size
k = 10  # Number of neighbors

# Generate sample data
xb = np.random.random((n, d)).astype('float32')
xq = np.random.random((100, d)).astype('float32')

# Exact search (brute force)
index_flat = faiss.IndexFlatL2(d)
index_flat.add(xb)

# IVF with PQ compression
nlist = 100  # Number of clusters
m = 8  # Number of subquantizers
index_ivfpq = faiss.IndexIVFPQ(
    faiss.IndexFlatL2(d), d, nlist, m, 8
)
index_ivfpq.train(xb)
index_ivfpq.add(xb)
index_ivfpq.nprobe = 10  # Search 10 clusters

# HNSW
index_hnsw = faiss.IndexHNSWFlat(d, 32)
index_hnsw.add(xb)

# Query
distances, indices = index_ivfpq.search(xq, k)

Index Selection Guide

Index TypeMemoryBuild TimeQuery TimeRecallUse Case
FlatO(Nd)O(N)O(N)100%Small datasets, ground truth
IVFO(Nd)O(N)O(N/√n_list)95%+Medium datasets
IVFPQO(NM·log₂k)O(Nd)O(N/√n_list)90%+Large datasets, memory constrained
HNSWO(Nd + NM)O(N·log N)O(log N)98%+Low latency requirements

Practical Application: S3 Metadata Classification

The Vectors project demonstrates these principles through a practical example: classifying S3 objects using metadata vectors.

S3 object metadata (bucket name, object key, size, timestamps) is encoded as vectors in ℝ^d. A classifier f: ℝ^d → Δ^(C-1) maps these vectors to probability distributions over classes:

f(x) = [P(y = Sensitive | x), P(y = Public | x), P(y = Archival | x)]

The distance between metadata vectors captures semantic similarity:

Δ(x_financial, x_logs) > Δ(x_financial, x_reports)

This enables automated data governance through learned representations.

Mathematical Properties for System Design

When designing vector retrieval systems, consider these mathematical properties:

1. Metric vs. Non-Metric Spaces

True metrics enable powerful data structures (metric trees, ball trees). The triangle inequality allows pruning:

Δ(q, v) ≥ |Δ(q, u) - Δ(u, v)|

Non-metrics like inner product require different index structures.

2. Dimensionality Reduction

Johnson-Lindenstrauss lemma guarantees that for ε ∈ (0, 1), there exists a mapping f: ℝ^d → ℝ^k where:

k = O(log(n) / ε²)

such that for all points u, v:

(1-ε)||u-v||_2² ≤ ||f(u)-f(v)||_2² ≤ (1+ε)||u-v||_2²

This enables significant dimensionality reduction while preserving distances.

3. Normalization Strategies

Pre-normalizing vectors to unit length converts:

  • L2 distance to angular distance (up to monotonic transformation)
  • Inner product search to cosine similarity search
  • Enables use of efficient L2-optimized indices for cosine similarity

Conclusion

The mathematics of vector retrieval provides a principled foundation for building semantic search systems. From the formal definition of the retrieval problem to the trade-offs between exact and approximate methods, understanding these mathematical structures enables informed system design.

Key takeaways:

  1. Distance function choice fundamentally affects retrieval behavior
  2. Dimensionality curse necessitates approximate methods at scale
  3. ANN algorithms trade accuracy for efficiency with theoretical guarantees
  4. FAISS provides production-ready implementations of these algorithms

The Vectors project provides hands-on demonstrations of these concepts, including a complete example of metadata-based classification using vector representations.

References

  1. Johnson, W., Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hilbert space.
  2. Indyk, P., Motwani, R. (1998). Approximate nearest neighbors: towards removing the curse of dimensionality.
  3. Malkov, Y., Yashunin, D. (2018). Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs.
  4. Jegou, H., Douze, M., Schmid, C. (2011). Product quantization for nearest neighbor search.
  5. FAISS Wiki - Facebook AI Similarity Search documentation.