Scalable Signature-Based Distribution Regression via Reference Sets
Andrew Alden, Carmine Ventre, Blanka Horvath
TL;DR
This work tackles the computational bottlenecks of higher-order distribution regression for path-valued data by introducing SPEEDRS, a landmark-distance approach that leverages a novel $2^{\text{nd}}$-order MMD approximator. By computing distances to fixed reference sets and employing a model-agnostic, pathwise MMD estimator based on expected signatures, the method achieves scalability across long paths, high dimensions, and large batch sizes while preserving distributional information. The approach is validated across three domains—derivative pricing, mixture parameter estimation, and physical science applications—demonstrating strong performance, robustness to distributional shifts, and resilience to irregular sampling. Overall, SPEEDRS enables practical, generalizable distribution regression on stochastic processes with significant reductions in memory and computation, broadening the applicability of signature-based methods in real-world tasks.
Abstract
Distribution Regression (DR) on stochastic processes describes the learning task of regression on collections of time series. Path signatures, a technique prevalent in stochastic analysis, have been used to solve the DR problem. Recent works have demonstrated the ability of such solutions to leverage the information encoded in paths via signature-based features. However, current state of the art DR solutions are memory intensive and incur a high computation cost. This leads to a trade-off between path length and the number of paths considered. This computational bottleneck limits the application to small sample sizes which consequently introduces estimation uncertainty. In this paper, we present a methodology for addressing the above issues; resolving estimation uncertainties whilst also proposing a pipeline that enables us to use DR for a wide variety of learning tasks. Integral to our approach is our novel distance approximator. This allows us to seamlessly apply our methodology across different application domains, sampling rates, and stochastic process dimensions. We show that our model performs well in applications related to estimation theory, quantitative finance, and physical sciences. We demonstrate that our model generalises well, not only to unseen data within a given distribution, but also under unseen regimes (unseen classes of stochastic models).
