Table of Contents
Fetching ...

D-optimal Subsampling Design for Massive Data Linear Regression

Torsten Glemser, Rainer Schwabe

TL;DR

The paper tackles data reduction in massive data linear regression by deriving D-optimal continuous subsampling designs bounded by the covariate distribution. It develops a principled framework based on the information matrix $\mathbf{M}(\xi)$ and the $D$-criterion, with an equivalence-theorem characterization that yields sampling regions outside concentration ellipsoids or spheres for elliptical and spherical covariates, respectively. It provides two practical implementations: a full Mahalanobis-distance-based D-OPT and a simplified, faster variant D-OPT-s, plus algorithms to generate subsamples, and demonstrates via simulations that D-OPT often outperforms IBOSS, especially when covariates are correlated or heavy-tailed. The results offer implementable guidance for constructing informative subsamples in fixed-subsample and fixed-proportion settings, with clear implications for large-scale regression where response costs dominate.

Abstract

Data reduction is a fundamental challenge of modern technology, where classical statistical methods are not applicable because of computational limitations. We consider multiple linear regression for an extraordinarily large number of observations, but only a few covariates. Subsampling aims at the selection of a given proportion of the existing original data. Under distributional assumptions on the covariates, we derive D-optimal subsampling designs and study their theoretical properties. We make use of fundamental concepts of optimal design theory and an equivalence theorem from constrained convex optimization. The thus obtained subsampling designs provide simple rules for whether to accept or reject a data point, allowing for an easy algorithmic implementation. In addition, we propose a simplified subsampling method with lower computational complexity that deviates from the D-optimal design. We present a simulation study, comparing both subsampling schemes with the IBOSS method in the case of a fixed size of the subsample.

D-optimal Subsampling Design for Massive Data Linear Regression

TL;DR

The paper tackles data reduction in massive data linear regression by deriving D-optimal continuous subsampling designs bounded by the covariate distribution. It develops a principled framework based on the information matrix and the -criterion, with an equivalence-theorem characterization that yields sampling regions outside concentration ellipsoids or spheres for elliptical and spherical covariates, respectively. It provides two practical implementations: a full Mahalanobis-distance-based D-OPT and a simplified, faster variant D-OPT-s, plus algorithms to generate subsamples, and demonstrates via simulations that D-OPT often outperforms IBOSS, especially when covariates are correlated or heavy-tailed. The results offer implementable guidance for constructing informative subsamples in fixed-subsample and fixed-proportion settings, with clear implications for large-scale regression where response costs dominate.

Abstract

Data reduction is a fundamental challenge of modern technology, where classical statistical methods are not applicable because of computational limitations. We consider multiple linear regression for an extraordinarily large number of observations, but only a few covariates. Subsampling aims at the selection of a given proportion of the existing original data. Under distributional assumptions on the covariates, we derive D-optimal subsampling designs and study their theoretical properties. We make use of fundamental concepts of optimal design theory and an equivalence theorem from constrained convex optimization. The thus obtained subsampling designs provide simple rules for whether to accept or reject a data point, allowing for an easy algorithmic implementation. In addition, we propose a simplified subsampling method with lower computational complexity that deviates from the D-optimal design. We present a simulation study, comparing both subsampling schemes with the IBOSS method in the case of a fixed size of the subsample.
Paper Structure (14 sections, 14 theorems, 39 equations, 8 figures, 1 table, 3 algorithms)

This paper contains 14 sections, 14 theorems, 39 equations, 8 figures, 1 table, 3 algorithms.

Key Result

Theorem 3.1

The subsampling design $\xi_{\alpha}^{*}$ is $D$-optimal if and only if $\xi_{\alpha}^{*}$ has density where $c$ satisfies $\operatorname{P}\left((\boldsymbol{X}_{i} - \boldsymbol{m}(\xi_{\alpha}^{*}))^{\top} \mathbf{S}(\xi_{\alpha}^{*})^{-1} (\boldsymbol{X}_{i} - \boldsymbol{m}(\xi_{\alpha}^{*})) \geq c \right) = \alpha$.

Figures (8)

  • Figure 1: Second moment $m_{2}(\xi_{\alpha}^{*})$ of the $D$-optimal subsampling design $\xi_{\alpha}^{*}$ for standard (multivariate) normal distributions of dimensions $d = 1$ (solid), $2$ (dashes), $5$ (long dashes), $10$ (dashes and dots), $50$ (long and short dashes), and $1\,000$ (dots) in dependence on the subsampling proportion $\alpha$
  • Figure 2: Density of the marginal optimal subsampling design $\xi_{R}^{*}$ (solid) and the marginal distribution of the covariates $R(\boldsymbol{X}_{i})$ (dashed) on the radius, standard bivariate normal distribution, subsampling proportion $\alpha = 0.1$
  • Figure 3: Efficiency of uniform random subsampling for multivariate normal distributions of dimensions $d = 1$ (solid), $2$ (dashes), $5$ (long dashes), $10$ (dashes and dots), $50$ (long and short dashes), and $1\,000$ (dots) in dependence on the subsampling proportion $\alpha$
  • Figure 4: Approximate (lines) and simulated (symbols) standardized mean squared errors and approximate efficiency of uniform random subsampling in dependence on full data size $n$, subsample size $k = 1\,000$, and various numbers $d$ of standard normal covariates
  • Figure 5: Simulated standardized determinant of the slope covariance matrix for normally distributed covariates, uncorrelated case (left) and correlation $\rho = 0.5$ (right)
  • ...and 3 more figures

Theorems & Definitions (31)

  • Theorem 3.1
  • Corollary 3.2
  • Corollary 3.3
  • Corollary 3.4
  • Theorem 3.5
  • Example 3.1: standard multivariate normal distribution
  • Example 3.2: multivariate $t$-distribution
  • Theorem 3.6
  • Example 3.3: general multivariate normal distribution
  • Lemma 3.7
  • ...and 21 more