Approximate Hausdorff Distance for Multi-Vector Databases

Dongfang Zhao

Approximate Hausdorff Distance for Multi-Vector Databases

Dongfang Zhao

TL;DR

We address the problem of comparing multi-vector data in large-scale vector databases via the Hausdorff distance, whose exact computation is $O(mn)$ and impractical for billion-scale sets. We introduce a principled ANN-based approximation with a bidirectional, cached-mapping strategy that achieves a symmetric Hausdorff estimate and provable error bounds, while remaining stable under translations, rotations, and non-uniform scaling. Theoretical guarantees quantify worst-case and data-dependent errors in terms of dataset geometry, intrinsic dimension, and query complexity, and the algorithm runs in $O(m\log n+n\log m)$ time per pair, enabling practical use in VectorDB pipelines. Overall, the framework provides a scalable, provably correct way to perform set-based similarity search on multi-vector representations, with potential extensions to other geometric measures and high-scale retrieval tasks.

Abstract

The Hausdorff distance is a fundamental measure for comparing sets of vectors, widely used in database theory and geometric algorithms. However, its exact computation is computationally expensive, often making it impractical for large-scale applications such as multi-vector databases. In this paper, we introduce an approximation framework that efficiently estimates the Hausdorff distance while maintaining rigorous error bounds. Our approach leverages approximate nearest-neighbor (ANN) search to construct a surrogate function that preserves essential geometric properties while significantly reducing computational complexity. We provide a formal analysis of approximation accuracy, deriving both worst-case and expected error bounds. Additionally, we establish theoretical guarantees on the stability of our method under transformations, including translation, rotation, and scaling, and quantify the impact of non-uniform scaling on approximation quality. This work provides a principled foundation for integrating Hausdorff distance approximations into large-scale data retrieval and similarity search applications, ensuring both computational efficiency and theoretical correctness.

Approximate Hausdorff Distance for Multi-Vector Databases

TL;DR

We address the problem of comparing multi-vector data in large-scale vector databases via the Hausdorff distance, whose exact computation is

and impractical for billion-scale sets. We introduce a principled ANN-based approximation with a bidirectional, cached-mapping strategy that achieves a symmetric Hausdorff estimate and provable error bounds, while remaining stable under translations, rotations, and non-uniform scaling. Theoretical guarantees quantify worst-case and data-dependent errors in terms of dataset geometry, intrinsic dimension, and query complexity, and the algorithm runs in

time per pair, enabling practical use in VectorDB pipelines. Overall, the framework provides a scalable, provably correct way to perform set-based similarity search on multi-vector representations, with potential extensions to other geometric measures and high-scale retrieval tasks.

Approximate Hausdorff Distance for Multi-Vector Databases

TL;DR

Abstract

Approximate Hausdorff Distance for Multi-Vector Databases

TL;DR

Abstract

Paper Structure

Table of Contents