Table of Contents
Fetching ...

ProHD: Projection-Based Hausdorff Distance Approximation

Jiuzhou Fu, Luanzheng Guo, Nathan R. Tallent, Dongfang Zhao

TL;DR

ProHD addresses the computational bottleneck of exactly computing the Hausdorff distance for large, high-dimensional datasets by projecting the data onto a small set of informative directions and focusing on extreme points. The method forms tiny subsets via centroid and PCA directions, then computes the distance on these subsets using fast ANN back-ends, guaranteeing an underestimation with a deterministic additive bound and monotonic convergence as more directions are added. Theoretical bounds (e.g., $H_{\mathcal{U}}(A,B)\le H(A,B)\le H_{\mathcal{U}}(A,B)+2\min_{u\in\mathcal{U}}\delta(u)$) underpin reliability, while empirical results on image, physics, and synthetic data demonstrate up to $10$–$100\times$ speedups with significantly lower error than random sampling. This approach enables scalable HD estimation in large vector databases and streaming contexts, offering a practical balance between efficiency and accuracy with broad applicability to geometric analysis and similarity search.

Abstract

The Hausdorff distance (HD) is a robust measure of set dissimilarity, but computing it exactly on large, high-dimensional datasets is prohibitively expensive. We propose \textbf{ProHD}, a projection-guided approximation algorithm that dramatically accelerates HD computation while maintaining high accuracy. ProHD identifies a small subset of candidate "extreme" points by projecting the data onto a few informative directions (such as the centroid axis and top principal components) and computing the HD on this subset. This approach guarantees an underestimate of the true HD with a bounded additive error and typically achieves results within a few percent of the exact value. In extensive experiments on image, physics, and synthetic datasets (up to two million points in $D=256$), ProHD runs 10--100$\times$ faster than exact algorithms while attaining 5--20$\times$ lower error than random sampling-based approximations. Our method enables practical HD calculations in scenarios like large vector databases and streaming data, where quick and reliable set distance estimation is needed.

ProHD: Projection-Based Hausdorff Distance Approximation

TL;DR

ProHD addresses the computational bottleneck of exactly computing the Hausdorff distance for large, high-dimensional datasets by projecting the data onto a small set of informative directions and focusing on extreme points. The method forms tiny subsets via centroid and PCA directions, then computes the distance on these subsets using fast ANN back-ends, guaranteeing an underestimation with a deterministic additive bound and monotonic convergence as more directions are added. Theoretical bounds (e.g., ) underpin reliability, while empirical results on image, physics, and synthetic data demonstrate up to speedups with significantly lower error than random sampling. This approach enables scalable HD estimation in large vector databases and streaming contexts, offering a practical balance between efficiency and accuracy with broad applicability to geometric analysis and similarity search.

Abstract

The Hausdorff distance (HD) is a robust measure of set dissimilarity, but computing it exactly on large, high-dimensional datasets is prohibitively expensive. We propose \textbf{ProHD}, a projection-guided approximation algorithm that dramatically accelerates HD computation while maintaining high accuracy. ProHD identifies a small subset of candidate "extreme" points by projecting the data onto a few informative directions (such as the centroid axis and top principal components) and computing the HD on this subset. This approach guarantees an underestimate of the true HD with a bounded additive error and typically achieves results within a few percent of the exact value. In extensive experiments on image, physics, and synthetic datasets (up to two million points in ), ProHD runs 10--100 faster than exact algorithms while attaining 5--20 lower error than random sampling-based approximations. Our method enables practical HD calculations in scenarios like large vector databases and streaming data, where quick and reliable set distance estimation is needed.

Paper Structure

This paper contains 36 sections, 27 equations, 5 figures, 2 tables, 3 algorithms.

Figures (5)

  • Figure 1: Average Relative Error (%) versus Runtime (s, log scale) for EBHD, ZHD, Points-Ruling-Out (Ruleout), and three approximate methods (ProHD, Random Sampling, Systematic Sampling) on four datasets. We exclude ANN-Exact from these plots because it always achieves zero error at runtimes strictly lower than EBHD, ZHD, or Ruleout.
  • Figure 2: Parameter Sensitivity: Average Relative Error (%) (solid lines, left y-axis) and Runtime (s) (dashed lines, right y-axis, log scale) versus selection fraction $\alpha$. Top row: CIFAR-10 ($D=64$, $n_A=n_B=6000$). Bottom row: Higgs ($D=28$, $n_A=n_B=100\,000$). ProHD’s error decreases sharply as $\alpha$ grows and remains much lower than Random and Systematic Sampling. Runtime of ProHD increases roughly linearly in $\alpha$, whereas sampling runtimes remain smaller until $\alpha$ becomes large.
  • Figure 3: Dimension Scalability: Relative Error (%) (solid lines, left y-axis) and Runtime (s) (dashed lines, right y-axis, log scale) versus embedding dimension $D$ for CIFAR-10 ($n_A=n_B=6000$), MNIST ($6000,6000$), and Random Clouds ($100\,000,100\,000$). ProHD’s error declines sharply with $D$, while runtime increases sublinearly. Random and Systematic Sampling maintain high error across all $D$.
  • Figure 4: Set-Size Ratio Scalability: Relative Error (%) (solid lines, left y-axis) and Runtime (s) (dashed lines, right y-axis, log scale) versus ratio $n_B/n_A$. Top: Higgs ($D=28$, $n_A=100\,000$). Bottom: Random Clouds ($D=4$, $n_A=100\,000$). ProHD maintains near-zero error on Higgs and substantially lower error on Random Clouds compared to sampling, even for imbalanced sets. ProHD’s runtime increases modestly with larger $n_B$.
  • Figure 5: Total Set-Size Scalability: Relative Error (%) (solid lines, left y-axis) and Runtime (s) (dashed lines, right y-axis, log scale) versus total size $n_A+n_B$. Left: Higgs ($D=28$). Right: Random Clouds ($D=4$). ProHD’s error stays low (below $0.5\%$ on Higgs, $6\%$ on Random Clouds) even at 2 million points, while runtime grows approximately linearly. Sampling methods show much higher error at both scales.

Theorems & Definitions (2)

  • proof : Sketch of Proof
  • proof : Sketch of Proof