Table of Contents
Fetching ...

Asymptotics of resampling without replacement in robust and logistic regression

Pierre C. Bellec, Takuya Koriyama

TL;DR

The work analyzes bagging estimators built from subsamples drawn without replacement in the high-dimensional proportional regime ($n/p=\delta$) for robust linear regression and logistic regression. A key contribution is a simple nonlinear fixed-point equation for the limiting cross-estimator correlation, $\eta=F(\eta)$, allowing the limiting bagged-risk to be expressed as $\sigma^2/M + (1-1/M)\sigma^2\eta$ and enabling data-driven estimation of $\eta$ and $\sigma^2$. The authors prove existence and uniqueness of the fixed point, establish convergence of pairwise inner products to $\eta\sigma^2$, and provide estimators that consistently recover these limits from overlaps among subsamples. Numerical simulations with Huber/pseudo-Huber losses and logistic loss validate the theory and show how subsample size $q$ can nontrivially affect risk, including potential U-shaped risk curves in certain regimes. These results offer practical guidance for tuning subsample sizes in bagging under high dimensionality and contribute a rigorous fixed-point framework for resampling without replacement in robust and GLM contexts.

Abstract

This paper studies the asymptotics of resampling without replacement in the proportional regime where dimension $p$ and sample size $n$ are of the same order. For a given dataset $(X,y)\in \mathbb{R}^{n\times p}\times \mathbb{R}^n$ and fixed subsample ratio $q\in(0,1)$, the practitioner samples independently of $(X,y)$ iid subsets $I_1,...,I_M$ of $\{1,...,n\}$ of size $q n$ and trains estimators $\hatβ(I_1),...,\hatβ(I_M)$ on the corresponding subsets of rows of $(X, y)$. Understanding the performance of the bagged estimate $\barβ = \frac1M\sum_{m=1}^M \hatβ(I_1),...,\hatβ(I_M)$, for instance its squared error, requires us to understand correlations between two distinct $\hatβ(I_m)$ and $\hatβ(I_{m'})$ trained on different subsets $I_m$ and $I_{m'}$. In robust linear regression and logistic regression, we characterize the limit in probability of the correlation between two estimates trained on different subsets of the data. The limit is characterized as the unique solution of a simple nonlinear equation. We further provide data-driven estimators that are consistent for estimating this limit. These estimators of the limiting correlation allow us to estimate the squared error of the bagged estimate $\barβ$, and for instance perform parameter tuning to choose the optimal subsample ratio $q$. As a by-product of the proof argument, we obtain the limiting distribution of the bivariate pair $(x_i^T \hatβ(I_m), x_i^T \hatβ(I_{m'}))$ for observations $i\in I_m\cap I_{m'}$, i.e., for observations used to train both estimates.

Asymptotics of resampling without replacement in robust and logistic regression

TL;DR

The work analyzes bagging estimators built from subsamples drawn without replacement in the high-dimensional proportional regime () for robust linear regression and logistic regression. A key contribution is a simple nonlinear fixed-point equation for the limiting cross-estimator correlation, , allowing the limiting bagged-risk to be expressed as and enabling data-driven estimation of and . The authors prove existence and uniqueness of the fixed point, establish convergence of pairwise inner products to , and provide estimators that consistently recover these limits from overlaps among subsamples. Numerical simulations with Huber/pseudo-Huber losses and logistic loss validate the theory and show how subsample size can nontrivially affect risk, including potential U-shaped risk curves in certain regimes. These results offer practical guidance for tuning subsample sizes in bagging under high dimensionality and contribute a rigorous fixed-point framework for resampling without replacement in robust and GLM contexts.

Abstract

This paper studies the asymptotics of resampling without replacement in the proportional regime where dimension and sample size are of the same order. For a given dataset and fixed subsample ratio , the practitioner samples independently of iid subsets of of size and trains estimators on the corresponding subsets of rows of . Understanding the performance of the bagged estimate , for instance its squared error, requires us to understand correlations between two distinct and trained on different subsets and . In robust linear regression and logistic regression, we characterize the limit in probability of the correlation between two estimates trained on different subsets of the data. The limit is characterized as the unique solution of a simple nonlinear equation. We further provide data-driven estimators that are consistent for estimating this limit. These estimators of the limiting correlation allow us to estimate the squared error of the bagged estimate , and for instance perform parameter tuning to choose the optimal subsample ratio . As a by-product of the proof argument, we obtain the limiting distribution of the bivariate pair for observations , i.e., for observations used to train both estimates.
Paper Structure (26 sections, 16 theorems, 94 equations, 8 figures)

This paper contains 26 sections, 16 theorems, 94 equations, 8 figures.

Key Result

Proposition 1

The function $F$ in def-F is non-decreasing and $q$-Lipschitz with $0\le F(0)\le q\le 1$. The equation $\eta = F(\eta)$ has a unique solution $\eta\in[0,q]$.

Figures (8)

  • Figure 1: Plot of ${q} \mapsto \eta$ and ${q}\mapsto \sigma^2\eta$ obtained by solving \ref{['eta_equation_robust']} numerically. Different noise distributions are given by $(\text{scale})\times \text{t-dist (df=2)}$, for scale$\in\{1,{ 1.5, 2, 5,} 10\}$. The dashed line is the affine line $q\mapsto (q-\delta^{-1})/(1-\delta^{-1})$. The bottom plots zoom in on a specific region of the top plots.
  • Figure 2: Comparison of simulation results, theoretical curves obtained by solving \ref{['eta_equation_robust']} numerically, and estimate constructed by \ref{['eq:thm_estimation']}. Here, the noise distribution is fixed to $3\times \text{t-dist(df=2)}$ and $(n, p)=(5000, 1000)$.
  • Figure 3: Comparison of simulation results, theoretical curves obtained by solving \ref{['eta_equation_logi']} numerically, and estimate constructed by \ref{['eq:thm_estimation']}, with $(n,p)$ fixed to $(5000, 500)$.
  • Figure 4: Plot of ${q} \mapsto \eta$ and ${q}\mapsto \sigma^2\eta$ obtained by solving \ref{['eta_equation_robust']} numerically. Different noise distributions are given by $(\text{scale})\times \text{t-dist (df=3)}$, for scale$\in\{1,{ 1.5, 2, 5,} 10\}$. The dashed line is the affine line $q\mapsto (q-\delta^{-1})/(1-\delta^{-1})$. The bottom plots zoom in on a specific region of the top plots.
  • Figure 5: Comparison of simulation results, theoretical curves obtained by solving \ref{['eta_equation_robust']} numerically, and estimate constructed by \ref{['eq:thm_estimation']}, for the pseudo Huber loss $\rho(x)=\sqrt{1+x^2}$. Here, the noise distribution is fixed to $4\times \text{t-dist(df=2)}$ and $(n, p)=(5000, 1000)$. The error bar is standard deviation with $10$ Monte Carlo simulations.
  • ...and 3 more figures

Theorems & Definitions (26)

  • Remark 1
  • Proposition 1
  • proof
  • Theorem 2.1
  • Proposition 2
  • Theorem 3.1
  • Proposition 3
  • proof
  • Lemma 1
  • proof
  • ...and 16 more