Table of Contents
Fetching ...

Machine learning and high dimensional vector search

Matthijs Douze

TL;DR

This paper investigates why machine learning has had limited impact on high-dimensional vector search despite parallel progress in both fields. It analyzes traditional IR with ML, the embedding-based two-tower paradigm, and the fundamental equivalence of vector search to a linear classifier under dot-product similarity, highlighting both barriers and potential ML-enabled avenues such as distribution modeling and learned quantization. It also surveys how vector search can benefit ML tasks (e.g., network compression and long-context attention) and discusses the limits of applying ML directly at scale, including hardware considerations and break-even points. The authors conclude that ML should augment rather than replace vector search data structures, advocating for ML-guided index construction (e.g., graph-based indexes) and leveraging VS to improve ML workloads, with practical implications for scalable retrieval, compression, and attention in large-scale systems.

Abstract

Machine learning and vector search are two research topics that developed in parallel in nearby communities. However, unlike many other fields related to big data, machine learning has not significantly impacted vector search. In this opinion paper we attempt to explain this oddity. Along the way, we wander over the numerous bridges between the two fields.

Machine learning and high dimensional vector search

TL;DR

This paper investigates why machine learning has had limited impact on high-dimensional vector search despite parallel progress in both fields. It analyzes traditional IR with ML, the embedding-based two-tower paradigm, and the fundamental equivalence of vector search to a linear classifier under dot-product similarity, highlighting both barriers and potential ML-enabled avenues such as distribution modeling and learned quantization. It also surveys how vector search can benefit ML tasks (e.g., network compression and long-context attention) and discusses the limits of applying ML directly at scale, including hardware considerations and break-even points. The authors conclude that ML should augment rather than replace vector search data structures, advocating for ML-guided index construction (e.g., graph-based indexes) and leveraging VS to improve ML workloads, with practical implications for scalable retrieval, compression, and attention in large-scale systems.

Abstract

Machine learning and vector search are two research topics that developed in parallel in nearby communities. However, unlike many other fields related to big data, machine learning has not significantly impacted vector search. In this opinion paper we attempt to explain this oddity. Along the way, we wander over the numerous bridges between the two fields.

Paper Structure

This paper contains 16 sections, 4 equations.