Vector Search with OpenAI Embeddings: Lucene Is All You Need
Jimmy Lin, Ronak Pradeep, Tommaso Teofili, Jasper Xian
TL;DR
The paper argues against the necessity of dedicated vector stores for dense retrieval, proposing that existing Lucene-based infrastructure suffices. It demonstrates an end-to-end vector search pipeline using OpenAI ada2 embeddings indexed with HNSW in Lucene (via Anserini) on the MS MARCO corpus. The results show competitive effectiveness on development and DL tracks without encoder training, highlighting a practical, production-friendly alternative to bespoke vector databases. The authors also discuss implementation challenges and outline ecosystem improvements that could further enhance performance and adoption.
Abstract
We provide a reproducible, end-to-end demonstration of vector search with OpenAI embeddings using Lucene on the popular MS MARCO passage ranking test collection. The main goal of our work is to challenge the prevailing narrative that a dedicated vector store is necessary to take advantage of recent advances in deep neural networks as applied to search. Quite the contrary, we show that hierarchical navigable small-world network (HNSW) indexes in Lucene are adequate to provide vector search capabilities in a standard bi-encoder architecture. This suggests that, from a simple cost-benefit analysis, there does not appear to be a compelling reason to introduce a dedicated vector store into a modern "AI stack" for search, since such applications have already received substantial investments in existing, widely deployed infrastructure.
