Asymptotics of resampling without replacement in robust and logistic regression

Pierre C. Bellec; Takuya Koriyama

Asymptotics of resampling without replacement in robust and logistic regression

Pierre C. Bellec, Takuya Koriyama

TL;DR

The work analyzes bagging estimators built from subsamples drawn without replacement in the high-dimensional proportional regime ($n/p=\delta$) for robust linear regression and logistic regression. A key contribution is a simple nonlinear fixed-point equation for the limiting cross-estimator correlation, $\eta=F(\eta)$, allowing the limiting bagged-risk to be expressed as $\sigma^2/M + (1-1/M)\sigma^2\eta$ and enabling data-driven estimation of $\eta$ and $\sigma^2$. The authors prove existence and uniqueness of the fixed point, establish convergence of pairwise inner products to $\eta\sigma^2$, and provide estimators that consistently recover these limits from overlaps among subsamples. Numerical simulations with Huber/pseudo-Huber losses and logistic loss validate the theory and show how subsample size $q$ can nontrivially affect risk, including potential U-shaped risk curves in certain regimes. These results offer practical guidance for tuning subsample sizes in bagging under high dimensionality and contribute a rigorous fixed-point framework for resampling without replacement in robust and GLM contexts.

Abstract

This paper studies the asymptotics of resampling without replacement in the proportional regime where dimension $p$ and sample size $n$ are of the same order. For a given dataset $(X,y)\in \mathbb{R}^{n\times p}\times \mathbb{R}^n$ and fixed subsample ratio $q\in(0,1)$, the practitioner samples independently of $(X,y)$ iid subsets $I_1,...,I_M$ of $\{1,...,n\}$ of size $q n$ and trains estimators $\hatβ(I_1),...,\hatβ(I_M)$ on the corresponding subsets of rows of $(X, y)$. Understanding the performance of the bagged estimate $\barβ = \frac1M\sum_{m=1}^M \hatβ(I_1),...,\hatβ(I_M)$, for instance its squared error, requires us to understand correlations between two distinct $\hatβ(I_m)$ and $\hatβ(I_{m'})$ trained on different subsets $I_m$ and $I_{m'}$. In robust linear regression and logistic regression, we characterize the limit in probability of the correlation between two estimates trained on different subsets of the data. The limit is characterized as the unique solution of a simple nonlinear equation. We further provide data-driven estimators that are consistent for estimating this limit. These estimators of the limiting correlation allow us to estimate the squared error of the bagged estimate $\barβ$, and for instance perform parameter tuning to choose the optimal subsample ratio $q$. As a by-product of the proof argument, we obtain the limiting distribution of the bivariate pair $(x_i^T \hatβ(I_m), x_i^T \hatβ(I_{m'}))$ for observations $i\in I_m\cap I_{m'}$, i.e., for observations used to train both estimates.

Asymptotics of resampling without replacement in robust and logistic regression

TL;DR

The work analyzes bagging estimators built from subsamples drawn without replacement in the high-dimensional proportional regime (

) for robust linear regression and logistic regression. A key contribution is a simple nonlinear fixed-point equation for the limiting cross-estimator correlation,

, allowing the limiting bagged-risk to be expressed as

and enabling data-driven estimation of

and

. The authors prove existence and uniqueness of the fixed point, establish convergence of pairwise inner products to

, and provide estimators that consistently recover these limits from overlaps among subsamples. Numerical simulations with Huber/pseudo-Huber losses and logistic loss validate the theory and show how subsample size

can nontrivially affect risk, including potential U-shaped risk curves in certain regimes. These results offer practical guidance for tuning subsample sizes in bagging under high dimensionality and contribute a rigorous fixed-point framework for resampling without replacement in robust and GLM contexts.

Abstract

This paper studies the asymptotics of resampling without replacement in the proportional regime where dimension

and sample size

are of the same order. For a given dataset

and fixed subsample ratio

, the practitioner samples independently of

iid subsets

of size

and trains estimators

on the corresponding subsets of rows of

. Understanding the performance of the bagged estimate

, for instance its squared error, requires us to understand correlations between two distinct

and

trained on different subsets

and

. In robust linear regression and logistic regression, we characterize the limit in probability of the correlation between two estimates trained on different subsets of the data. The limit is characterized as the unique solution of a simple nonlinear equation. We further provide data-driven estimators that are consistent for estimating this limit. These estimators of the limiting correlation allow us to estimate the squared error of the bagged estimate

, and for instance perform parameter tuning to choose the optimal subsample ratio

. As a by-product of the proof argument, we obtain the limiting distribution of the bivariate pair

for observations

, i.e., for observations used to train both estimates.

Paper Structure (26 sections, 16 theorems, 94 equations, 8 figures)

This paper contains 26 sections, 16 theorems, 94 equations, 8 figures.

Introduction
M-estimation in the proportional regime
Bagging estimators trained on subsampled datasets without replacement
Related work
Robust regression
A review of existing results in robust linear regression
A glance at our results
Existence and uniqueness of solutions to the fixed-point equation
Main results in robust regression
Numerical simulations in robust regression
Resampling without replacement in logistic regression
A review of existing results in logistic regression
Main results for logistic regression
Numerical simulations in logistic regression
Proof of the main results
...and 11 more sections

Key Result

Proposition 1

The function $F$ in def-F is non-decreasing and $q$-Lipschitz with $0\le F(0)\le q\le 1$. The equation $\eta = F(\eta)$ has a unique solution $\eta\in[0,q]$.

Figures (8)

Figure 1: Plot of ${q} \mapsto \eta$ and ${q}\mapsto \sigma^2\eta$ obtained by solving \ref{['eta_equation_robust']} numerically. Different noise distributions are given by $(\text{scale})\times \text{t-dist (df=2)}$, for scale$\in\{1,{ 1.5, 2, 5,} 10\}$. The dashed line is the affine line $q\mapsto (q-\delta^{-1})/(1-\delta^{-1})$. The bottom plots zoom in on a specific region of the top plots.
Figure 2: Comparison of simulation results, theoretical curves obtained by solving \ref{['eta_equation_robust']} numerically, and estimate constructed by \ref{['eq:thm_estimation']}. Here, the noise distribution is fixed to $3\times \text{t-dist(df=2)}$ and $(n, p)=(5000, 1000)$.
Figure 3: Comparison of simulation results, theoretical curves obtained by solving \ref{['eta_equation_logi']} numerically, and estimate constructed by \ref{['eq:thm_estimation']}, with $(n,p)$ fixed to $(5000, 500)$.
Figure 4: Plot of ${q} \mapsto \eta$ and ${q}\mapsto \sigma^2\eta$ obtained by solving \ref{['eta_equation_robust']} numerically. Different noise distributions are given by $(\text{scale})\times \text{t-dist (df=3)}$, for scale$\in\{1,{ 1.5, 2, 5,} 10\}$. The dashed line is the affine line $q\mapsto (q-\delta^{-1})/(1-\delta^{-1})$. The bottom plots zoom in on a specific region of the top plots.
Figure 5: Comparison of simulation results, theoretical curves obtained by solving \ref{['eta_equation_robust']} numerically, and estimate constructed by \ref{['eq:thm_estimation']}, for the pseudo Huber loss $\rho(x)=\sqrt{1+x^2}$. Here, the noise distribution is fixed to $4\times \text{t-dist(df=2)}$ and $(n, p)=(5000, 1000)$. The error bar is standard deviation with $10$ Monte Carlo simulations.
...and 3 more figures

Theorems & Definitions (26)

Remark 1
Proposition 1
proof
Theorem 2.1
Proposition 2
Theorem 3.1
Proposition 3
proof
Lemma 1
proof
...and 16 more

Asymptotics of resampling without replacement in robust and logistic regression

TL;DR

Abstract

Asymptotics of resampling without replacement in robust and logistic regression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (26)