D-optimal Subsampling Design for Massive Data Linear Regression
Torsten Glemser, Rainer Schwabe
TL;DR
The paper tackles data reduction in massive data linear regression by deriving D-optimal continuous subsampling designs bounded by the covariate distribution. It develops a principled framework based on the information matrix $\mathbf{M}(\xi)$ and the $D$-criterion, with an equivalence-theorem characterization that yields sampling regions outside concentration ellipsoids or spheres for elliptical and spherical covariates, respectively. It provides two practical implementations: a full Mahalanobis-distance-based D-OPT and a simplified, faster variant D-OPT-s, plus algorithms to generate subsamples, and demonstrates via simulations that D-OPT often outperforms IBOSS, especially when covariates are correlated or heavy-tailed. The results offer implementable guidance for constructing informative subsamples in fixed-subsample and fixed-proportion settings, with clear implications for large-scale regression where response costs dominate.
Abstract
Data reduction is a fundamental challenge of modern technology, where classical statistical methods are not applicable because of computational limitations. We consider multiple linear regression for an extraordinarily large number of observations, but only a few covariates. Subsampling aims at the selection of a given proportion of the existing original data. Under distributional assumptions on the covariates, we derive D-optimal subsampling designs and study their theoretical properties. We make use of fundamental concepts of optimal design theory and an equivalence theorem from constrained convex optimization. The thus obtained subsampling designs provide simple rules for whether to accept or reject a data point, allowing for an easy algorithmic implementation. In addition, we propose a simplified subsampling method with lower computational complexity that deviates from the D-optimal design. We present a simulation study, comparing both subsampling schemes with the IBOSS method in the case of a fixed size of the subsample.
