Table of Contents
Fetching ...

Vector Search with OpenAI Embeddings: Lucene Is All You Need

Jimmy Lin, Ronak Pradeep, Tommaso Teofili, Jasper Xian

TL;DR

The paper argues against the necessity of dedicated vector stores for dense retrieval, proposing that existing Lucene-based infrastructure suffices. It demonstrates an end-to-end vector search pipeline using OpenAI ada2 embeddings indexed with HNSW in Lucene (via Anserini) on the MS MARCO corpus. The results show competitive effectiveness on development and DL tracks without encoder training, highlighting a practical, production-friendly alternative to bespoke vector databases. The authors also discuss implementation challenges and outline ecosystem improvements that could further enhance performance and adoption.

Abstract

We provide a reproducible, end-to-end demonstration of vector search with OpenAI embeddings using Lucene on the popular MS MARCO passage ranking test collection. The main goal of our work is to challenge the prevailing narrative that a dedicated vector store is necessary to take advantage of recent advances in deep neural networks as applied to search. Quite the contrary, we show that hierarchical navigable small-world network (HNSW) indexes in Lucene are adequate to provide vector search capabilities in a standard bi-encoder architecture. This suggests that, from a simple cost-benefit analysis, there does not appear to be a compelling reason to introduce a dedicated vector store into a modern "AI stack" for search, since such applications have already received substantial investments in existing, widely deployed infrastructure.

Vector Search with OpenAI Embeddings: Lucene Is All You Need

TL;DR

The paper argues against the necessity of dedicated vector stores for dense retrieval, proposing that existing Lucene-based infrastructure suffices. It demonstrates an end-to-end vector search pipeline using OpenAI ada2 embeddings indexed with HNSW in Lucene (via Anserini) on the MS MARCO corpus. The results show competitive effectiveness on development and DL tracks without encoder training, highlighting a practical, production-friendly alternative to bespoke vector databases. The authors also discuss implementation challenges and outline ecosystem improvements that could further enhance performance and adoption.

Abstract

We provide a reproducible, end-to-end demonstration of vector search with OpenAI embeddings using Lucene on the popular MS MARCO passage ranking test collection. The main goal of our work is to challenge the prevailing narrative that a dedicated vector store is necessary to take advantage of recent advances in deep neural networks as applied to search. Quite the contrary, we show that hierarchical navigable small-world network (HNSW) indexes in Lucene are adequate to provide vector search capabilities in a standard bi-encoder architecture. This suggests that, from a simple cost-benefit analysis, there does not appear to be a compelling reason to introduce a dedicated vector store into a modern "AI stack" for search, since such applications have already received substantial investments in existing, widely deployed infrastructure.
Paper Structure (8 sections, 1 figure, 1 table)

This paper contains 8 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: A standard bi-encoder architecture, where encoders generate dense vector representations (embeddings) from queries and documents (passages). Retrieval is framed as $k$-nearest neighbor search in vector space.