Efficient Data Access Paths for Mixed Vector-Relational Search
Viktor Sanca, Anastasia Ailamaki
TL;DR
The paper tackles efficient data access for mixed vector-relational search in in-memory systems, addressing the challenge that vector indexes optimize search for vector similarity but complicate relational filtering. It presents two complementary approaches: an exhaustive scan path with relational predicate pushdown and a batched tensor formulation, and an approximate index-based path (e.g., HNSW) augmented with relational filtering. Through extensive benchmarking across data dimensionality, selectivity, and batch sizes, it demonstrates a cross-over point where scan-based and index-based costs intersect, and that the optimal choice is workload-driven rather than static. The findings highlight the importance of adaptively selecting between scan and probe paths to achieve orders-of-magnitude performance differences, and discuss implications for future hardware trends such as AMX and high-bandwidth memory in vector-relational data management. Overall, the work provides practical guidelines for end-to-end vector-relational query planning and reveals when to favor exhaustive scans, batched tensor computations, or approximate indexing depending on the workload and hardware context.
Abstract
The rapid growth of machine learning capabilities and the adoption of data processing methods using vector embeddings sparked a great interest in creating systems for vector data management. While the predominant approach of vector data management is to use specialized index structures for fast search over the entirety of the vector embeddings, once combined with other (meta)data, the search queries can also become selective on relational attributes - typical for analytical queries. As using vector indexes differs from traditional relational data access, we revisit and analyze alternative access paths for efficient mixed vector-relational search. We first evaluate the accurate but exhaustive scan-based search and propose hardware optimizations and alternative tensor-based formulation and batching to offset the cost. We outline the complex access-path design space, primarily driven by relational selectivity, and the decisions to consider when selecting an exhaustive scan-based search against an approximate index-based approach. Since the vector index primarily avoids expensive computation across the entire dataset, contrary to the common relational knowledge, it is better to scan at lower selectivity and probe at higher, with a cross-point between the two approaches dictated by data dimensionality and the number of concurrent search queries.
