Table of Contents
Fetching ...

A subsampling approach for large data sets when the Generalised Linear Model is potentially misspecified

Amalan Mahendran, Helen Thompson, James M. McGree

TL;DR

This work tackles scalable inference on large datasets when GLMs may be misspecified. It replaces model-reliant subsampling probabilities with an AMSE-based criterion that accounts for misspecification, using a two-stage subsampling design (Stage 1 to estimate parameters and misspecification; Stage 2 to compute refined probabilities and subsample) to yield the final weighted GLM fit. The authors compare their RLmAMSE approach against traditional $A$-, $L$-, and $L_1$-optimality methods across linear, logistic, and Poisson regression, in both simulated and real-world settings, including skin segmentation and song-play counts; RLmAMSE often achieves the best or near-best predictive performance, especially when misspecification is present, and remains competitive when misspecification is absent. They further show that GAM-based misspecification estimation and Stage-1-derived probabilities approximate full-data benchmarks well, offering substantial computational savings without sacrificing accuracy. The results advocate adopting misspecification-aware subsampling as a practical tool for efficient, reliable inference on very large GLM-structured datasets, with potential extensions to overdispersion and broader misspecification forms.

Abstract

Subsampling is a computationally efficient and scalable method to draw inference in large data settings based on a subset of the data rather than needing to consider the whole dataset. When employing subsampling techniques, a crucial consideration is how to select an informative subset based on the queries posed by the data analyst. A recently proposed method for this purpose involves randomly selecting samples from the large dataset based on subsampling probabilities. However, a major drawback of this approach is that the derived subsampling probabilities are typically based on an assumed statistical model which may be difficult to correctly specify in practice. To address this limitation, we propose to determine subsampling probabilities based on a statistical model that we acknowledge may be misspecified. To do so, we propose to evaluate the subsampling probabilities based on the Mean Squared Error (MSE) of the predictions from a model that is not assumed to completely describe the large dataset. We apply our subsampling approach in a simulation study and for the analysis of two real-world large datasets, where its performance is benchmarked against existing subsampling techniques. The findings suggest that there is value in adopting our approach over current practice.

A subsampling approach for large data sets when the Generalised Linear Model is potentially misspecified

TL;DR

This work tackles scalable inference on large datasets when GLMs may be misspecified. It replaces model-reliant subsampling probabilities with an AMSE-based criterion that accounts for misspecification, using a two-stage subsampling design (Stage 1 to estimate parameters and misspecification; Stage 2 to compute refined probabilities and subsample) to yield the final weighted GLM fit. The authors compare their RLmAMSE approach against traditional -, -, and -optimality methods across linear, logistic, and Poisson regression, in both simulated and real-world settings, including skin segmentation and song-play counts; RLmAMSE often achieves the best or near-best predictive performance, especially when misspecification is present, and remains competitive when misspecification is absent. They further show that GAM-based misspecification estimation and Stage-1-derived probabilities approximate full-data benchmarks well, offering substantial computational savings without sacrificing accuracy. The results advocate adopting misspecification-aware subsampling as a practical tool for efficient, reliable inference on very large GLM-structured datasets, with potential extensions to overdispersion and broader misspecification forms.

Abstract

Subsampling is a computationally efficient and scalable method to draw inference in large data settings based on a subset of the data rather than needing to consider the whole dataset. When employing subsampling techniques, a crucial consideration is how to select an informative subset based on the queries posed by the data analyst. A recently proposed method for this purpose involves randomly selecting samples from the large dataset based on subsampling probabilities. However, a major drawback of this approach is that the derived subsampling probabilities are typically based on an assumed statistical model which may be difficult to correctly specify in practice. To address this limitation, we propose to determine subsampling probabilities based on a statistical model that we acknowledge may be misspecified. To do so, we propose to evaluate the subsampling probabilities based on the Mean Squared Error (MSE) of the predictions from a model that is not assumed to completely describe the large dataset. We apply our subsampling approach in a simulation study and for the analysis of two real-world large datasets, where its performance is benchmarked against existing subsampling techniques. The findings suggest that there is value in adopting our approach over current practice.

Paper Structure

This paper contains 24 sections, 13 equations, 13 figures, 4 tables, 2 algorithms.

Figures (13)

  • Figure 1: Rows (top to bottom): Linear predictor against covariates for misspecification Types 1, 2a, 3a, 2b and 3b under linear regression, respectively. Columns (left to right): covariates $x_1$ and $x_2$ for models $1$ and $10$. Colours: the data generating model data points in green and analysis model data points in red.
  • Figure 2: Logarithm-scaled AMSME for our proposed misspecification estimation and the adewale2009robust approach under (for rows top to bottom) misspecification Types 1, 2a, 3a, 2b and 3b, under (columns left to right) linear, logistic and Poisson regression models.
  • Figure 3: Logarithm-scaled AMSME for our proposed misspecification estimation and the adewale2009robust approach under (for rows top to bottom) misspecification Types 2c, 3c, 2d, 3d, 2e and 3e, under (columns left to right) linear, logistic and Poisson regression models.
  • Figure 4: Rows (top to bottom):Subsampling probabilities across $M=10$ simulations against covariate $x_1$ for misspecification Types 1, 2 and 3 under linear regression, respectively. Columns (left to right): for models 1, 2, 3 and 4. Colours: in green the probabilities if large dataset is used and in red probabilities based on the subsample.
  • Figure 5: SML under (a) Type 1 - No misspecification, (b) Type 2 - Fixed misspecification and (c) Type 3 - Neighbourhood misspecification for the linear regression model under the subsampling methods: random, $A$-optimality, $L$-optimality, $L_1$-optimality, RLmAMSE and power or log odds function enhanced RLmAMSE.
  • ...and 8 more figures