Magnitude Distance: A Geometric Measure of Dataset Similarity
Sahel Torkamani, Henry Gouk, Rik Sarkar
TL;DR
This work introduces magnitude distance, a geometry-based metric between finite datasets built from the magnitude of metric spaces, featuring a tunable scale parameter $t$ that balances global versus local data structure. By leveraging a kernelized similarity matrix and its inverse, the approach maintains discriminability in high dimensions and offers principled robustness to outliers. The authors establish core properties, including symmetry, non-negativity, and a scale-dependent limiting behavior, and show how magnitude distance can serve as a training objective in push-forward generative models, exemplified by the Magnitude Generative Network (MagGN). Through theoretical analysis and empirical studies on MNIST, CIFAR-10, and CelebA, the method demonstrates meaningful dataset discrimination, training efficiency, and improved downstream performance, suggesting broad applicability in hypothesis testing, distribution shift robustness, and privacy-aware data analysis.
Abstract
Quantifying the distance between datasets is a fundamental question in mathematics and machine learning. We propose \textit{magnitude distance}, a novel distance metric defined on finite datasets using the notion of the \emph{magnitude} of a metric space. The proposed distance incorporates a tunable scaling parameter, $t$, that controls the sensitivity to global structure (small $t$) and finer details (large $t$). We prove several theoretical properties of magnitude distance, including its limiting behavior across scales and conditions under which it satisfies key metric properties. In contrast to classical distances, we show that magnitude distance remains discriminative in high-dimensional settings when the scale is appropriately tuned. We further demonstrate how magnitude distance can be used as a training objective for push-forward generative models. Our experimental results support our theoretical analysis and demonstrate that magnitude distance provides meaningful signals, comparable to established distance-based generative approaches.
