KScaNN: Scalable Approximate Nearest Neighbor Search on Kunpeng
Oleg Senkevich, Siyang Xu, Tianyi Jiang, Alexander Radionov, Jan Tabaszewski, Dmitriy Malyshev, Zijian Li, Daihao Xue, Licheng Yu, Weidi Zeng, Meiling Wang, Xin Yao, Siyu Huang, Gleb Neshchetkin, Qiuling Pan, Yaoyao Fu
TL;DR
KScaNN presents a hardware-aware, co-designed approach to scalable approximate nearest neighbor search on Kunpeng ARM systems. By integrating ML-driven adaptive search, a hybrid intra-cluster graph strategy, and highly optimized ARM SIMD kernels for PQ distance calculations, it closes the ARM-x86 performance gap and achieves up to $1.63\times$ speedup over leading x86 baselines. The work demonstrates that data-aware dimensionality reduction, per-query parameter tuning, and tight hardware specialization are crucial for high-throughput vector search on modern ARM CPUs. Its findings offer a practical blueprint for deploying leadership-class ANNS on ARM infrastructure and highlight the importance of co-design between algorithms and heterogeneous hardware.
Abstract
Approximate Nearest Neighbor Search (ANNS) is a cornerstone algorithm for information retrieval, recommendation systems, and machine learning applications. While x86-based architectures have historically dominated this domain, the increasing adoption of ARM-based servers in industry presents a critical need for ANNS solutions optimized on ARM architectures. A naive port of existing x86 ANNS algorithms to ARM platforms results in a substantial performance deficit, failing to leverage the unique capabilities of the underlying hardware. To address this challenge, we introduce KScaNN, a novel ANNS algorithm co-designed for the Kunpeng 920 ARM architecture. KScaNN embodies a holistic approach that synergizes sophisticated, data aware algorithmic refinements with carefully-designed hardware specific optimizations. Its core contributions include: 1) novel algorithmic techniques, including a hybrid intra-cluster search strategy and an improved PQ residual calculation method, which optimize the search process at a higher level; 2) an ML-driven adaptive search module that provides adaptive, per-query tuning of search parameters, eliminating the inefficiencies of static configurations; and 3) highly-optimized SIMD kernels for ARM that maximize hardware utilization for the critical distance computation workloads. The experimental results demonstrate that KScaNN not only closes the performance gap but establishes a new standard, achieving up to a 1.63x speedup over the fastest x86-based solution. This work provides a definitive blueprint for achieving leadership-class performance for vector search on modern ARM architectures and underscores
