D-optimal Subsampling Design for Massive Data Linear Regression

Torsten Glemser; Rainer Schwabe

D-optimal Subsampling Design for Massive Data Linear Regression

Torsten Glemser, Rainer Schwabe

TL;DR

The paper tackles data reduction in massive data linear regression by deriving D-optimal continuous subsampling designs bounded by the covariate distribution. It develops a principled framework based on the information matrix $\mathbf{M}(\xi)$ and the $D$-criterion, with an equivalence-theorem characterization that yields sampling regions outside concentration ellipsoids or spheres for elliptical and spherical covariates, respectively. It provides two practical implementations: a full Mahalanobis-distance-based D-OPT and a simplified, faster variant D-OPT-s, plus algorithms to generate subsamples, and demonstrates via simulations that D-OPT often outperforms IBOSS, especially when covariates are correlated or heavy-tailed. The results offer implementable guidance for constructing informative subsamples in fixed-subsample and fixed-proportion settings, with clear implications for large-scale regression where response costs dominate.

Abstract

Data reduction is a fundamental challenge of modern technology, where classical statistical methods are not applicable because of computational limitations. We consider multiple linear regression for an extraordinarily large number of observations, but only a few covariates. Subsampling aims at the selection of a given proportion of the existing original data. Under distributional assumptions on the covariates, we derive D-optimal subsampling designs and study their theoretical properties. We make use of fundamental concepts of optimal design theory and an equivalence theorem from constrained convex optimization. The thus obtained subsampling designs provide simple rules for whether to accept or reject a data point, allowing for an easy algorithmic implementation. In addition, we propose a simplified subsampling method with lower computational complexity that deviates from the D-optimal design. We present a simulation study, comparing both subsampling schemes with the IBOSS method in the case of a fixed size of the subsample.

D-optimal Subsampling Design for Massive Data Linear Regression

TL;DR

and the

-criterion, with an equivalence-theorem characterization that yields sampling regions outside concentration ellipsoids or spheres for elliptical and spherical covariates, respectively. It provides two practical implementations: a full Mahalanobis-distance-based D-OPT and a simplified, faster variant D-OPT-s, plus algorithms to generate subsamples, and demonstrates via simulations that D-OPT often outperforms IBOSS, especially when covariates are correlated or heavy-tailed. The results offer implementable guidance for constructing informative subsamples in fixed-subsample and fixed-proportion settings, with clear implications for large-scale regression where response costs dominate.

Abstract

Paper Structure (14 sections, 14 theorems, 39 equations, 8 figures, 1 table, 3 algorithms)

This paper contains 14 sections, 14 theorems, 39 equations, 8 figures, 1 table, 3 algorithms.

Introduction
Model Specification
Continuous Subsampling Design
General Case
A Single Covariate
Multiple Covariates With Elliptical Distribution
Subsampling Algorithms
Subsampling Design with Fixed Sample Size, Simulation
Simulation Setup
Simulation Results for Algorithm \ref{['alg:topk']}
Computational Complexity
Simplified Algorithm
Discussion
Technical Details

Key Result

Theorem 3.1

The subsampling design $\xi_{\alpha}^{*}$ is $D$-optimal if and only if $\xi_{\alpha}^{*}$ has density where $c$ satisfies $\operatorname{P}\left((\boldsymbol{X}_{i} - \boldsymbol{m}(\xi_{\alpha}^{*}))^{\top} \mathbf{S}(\xi_{\alpha}^{*})^{-1} (\boldsymbol{X}_{i} - \boldsymbol{m}(\xi_{\alpha}^{*})) \geq c \right) = \alpha$.

Figures (8)

Figure 1: Second moment $m_{2}(\xi_{\alpha}^{*})$ of the $D$-optimal subsampling design $\xi_{\alpha}^{*}$ for standard (multivariate) normal distributions of dimensions $d = 1$ (solid), $2$ (dashes), $5$ (long dashes), $10$ (dashes and dots), $50$ (long and short dashes), and $1\,000$ (dots) in dependence on the subsampling proportion $\alpha$
Figure 2: Density of the marginal optimal subsampling design $\xi_{R}^{*}$ (solid) and the marginal distribution of the covariates $R(\boldsymbol{X}_{i})$ (dashed) on the radius, standard bivariate normal distribution, subsampling proportion $\alpha = 0.1$
Figure 3: Efficiency of uniform random subsampling for multivariate normal distributions of dimensions $d = 1$ (solid), $2$ (dashes), $5$ (long dashes), $10$ (dashes and dots), $50$ (long and short dashes), and $1\,000$ (dots) in dependence on the subsampling proportion $\alpha$
Figure 4: Approximate (lines) and simulated (symbols) standardized mean squared errors and approximate efficiency of uniform random subsampling in dependence on full data size $n$, subsample size $k = 1\,000$, and various numbers $d$ of standard normal covariates
Figure 5: Simulated standardized determinant of the slope covariance matrix for normally distributed covariates, uncorrelated case (left) and correlation $\rho = 0.5$ (right)
...and 3 more figures

Theorems & Definitions (31)

Theorem 3.1
Corollary 3.2
Corollary 3.3
Corollary 3.4
Theorem 3.5
Example 3.1: standard multivariate normal distribution
Example 3.2: multivariate $t$-distribution
Theorem 3.6
Example 3.3: general multivariate normal distribution
Lemma 3.7
...and 21 more

D-optimal Subsampling Design for Massive Data Linear Regression

TL;DR

Abstract

D-optimal Subsampling Design for Massive Data Linear Regression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (31)