Table of Contents
Fetching ...

PDX: A Data Layout for Vector Similarity Search

Leonardo Kuffo, Elena Krippner, Peter Boncz

TL;DR

KNNS/VSS on high-dimensional vectors is computationally intensive, motivating the development of PDX, a vertical data layout that stores vectors in blocks to enable dimension-by-dimension distance calculations and pruning. The authors introduce PDXearch for exact and approximate pruning within IVF-like indexes and PDX-BOND as an exact DCO optimizer that leverages query-aware dimension access without preprocessing. Across 10 datasets and 4 CPU architectures, PDX auto-vectorized distance kernels outperform traditional horizontal layouts, and PDX-based pruning methods (ADSampling/BSA) achieve 2–7x speedups, with PDX-BOND delivering 2.5x–6x faster exact searches than FAISS/USearch/Milvus in many cases. The work provides open-source C++ implementations and outlines practical implications for vector databases, including updates-friendly designs and potential GPU extensions.

Abstract

We propose Partition Dimensions Across (PDX), a data layout for vectors (e.g., embeddings) that, similar to PAX [6], stores multiple vectors in one block, using a vertical layout for the dimensions (Figure 1). PDX accelerates exact and approximate similarity search thanks to its dimension-by-dimension search strategy that operates on multiple-vectors-at-a-time in tight loops. It beats SIMD-optimized distance kernels on standard horizontal vector storage (avg 40% faster), only relying on scalar code that gets auto-vectorized. We combined the PDX layout with recent dimension-pruning algorithms ADSampling [19] and BSA [52] that accelerate approximate vector search. We found that these algorithms on the horizontal vector layout can lose to SIMD-optimized linear scans, even if they are SIMD-optimized. However, when used on PDX, their benefit is restored to 2-7x. We find that search on PDX is especially fast if a limited number of dimensions has to be scanned fully, which is what the dimension-pruning approaches do. We finally introduce PDX-BOND, an even more flexible dimension-pruning strategy, with good performance on exact search and reasonable performance on approximate search. Unlike previous pruning algorithms, it can work on vector data "as-is" without preprocessing; making it attractive for vector databases with frequent updates.

PDX: A Data Layout for Vector Similarity Search

TL;DR

KNNS/VSS on high-dimensional vectors is computationally intensive, motivating the development of PDX, a vertical data layout that stores vectors in blocks to enable dimension-by-dimension distance calculations and pruning. The authors introduce PDXearch for exact and approximate pruning within IVF-like indexes and PDX-BOND as an exact DCO optimizer that leverages query-aware dimension access without preprocessing. Across 10 datasets and 4 CPU architectures, PDX auto-vectorized distance kernels outperform traditional horizontal layouts, and PDX-based pruning methods (ADSampling/BSA) achieve 2–7x speedups, with PDX-BOND delivering 2.5x–6x faster exact searches than FAISS/USearch/Milvus in many cases. The work provides open-source C++ implementations and outlines practical implications for vector databases, including updates-friendly designs and potential GPU extensions.

Abstract

We propose Partition Dimensions Across (PDX), a data layout for vectors (e.g., embeddings) that, similar to PAX [6], stores multiple vectors in one block, using a vertical layout for the dimensions (Figure 1). PDX accelerates exact and approximate similarity search thanks to its dimension-by-dimension search strategy that operates on multiple-vectors-at-a-time in tight loops. It beats SIMD-optimized distance kernels on standard horizontal vector storage (avg 40% faster), only relying on scalar code that gets auto-vectorized. We combined the PDX layout with recent dimension-pruning algorithms ADSampling [19] and BSA [52] that accelerate approximate vector search. We found that these algorithms on the horizontal vector layout can lose to SIMD-optimized linear scans, even if they are SIMD-optimized. However, when used on PDX, their benefit is restored to 2-7x. We find that search on PDX is especially fast if a limited number of dimensions has to be scanned fully, which is what the dimension-pruning approaches do. We finally introduce PDX-BOND, an even more flexible dimension-pruning strategy, with good performance on exact search and reasonable performance on approximate search. Unlike previous pruning algorithms, it can work on vector data "as-is" without preprocessing; making it attractive for vector databases with frequent updates.

Paper Structure

This paper contains 20 sections, 12 figures, 7 tables.

Figures (12)

  • Figure 1: PDX stores dimensions in a vertical layout, allowing efficient dimension-by-dimension distance calculation, more opportunities for SIMD execution, and better memory locality for search algorithms that prune dimensions.
  • Figure 2: Example of an IVF index on a collection of vectors. The IVF buckets naturally map to the concept of blocks of vectors in the PDX layout.
  • Figure 3: An Inner Product calculation on the horizontal layout (N-ary) and the PDX layout with 128-bit SIMD registers. The PDX kernel does not have dependencies (the distances of different vectors are aggregated in different SIMD lanes), is unaffected by dimensionality, and avoids the register reduce step. Constructing the PDX layout on-the-fly from the N-ary layout for calculations introduces a non-negligible overhead (N-ary + Gather), as discussed in section \ref{['sec:discussion']}.
  • Figure 4: The PDXearch framework within an IVF index: A search happens dimension-by-dimension per block (bucket). A linear scan is done in the first block ($C_0$ in the figure) to set a pruning threshold. In the following blocks ($C_1$), the search has two phases: WARMUP (keep scanning all vectors at incremental steps of D) and PRUNE (scan only the not-yet pruned vectors once they are few).
  • Figure 5: Example of three query-aware access order criteria: i) Decreasing: the dimension with the highest query value is accessed first ($D_1$ in the figure), ii) Distance to means: the dimension of the query with the largest distance to the collection means is accessed first ($D_5$), iii) Dimension zones: the subset of consecutive dimensions with the highest distance to the collection means are accessed first ($DZ_2$ in the figure).
  • ...and 7 more figures