Table of Contents
Fetching ...

Distributed Estimation and Inference for Semi-parametric Binary Response Models

Xi Chen, Wenbo Jing, Weidong Liu, Yichen Zhang

TL;DR

This paper addresses estimation and inference for a semiparametric binary response model in a distributed data setting without assuming a noise distribution, focusing on the maximum score framework which is nondifferentiable and nonregular.It introduces two distributed approaches: Avg-SMSE, a one-shot divide-and-conquer method that smooths the objective to achieve faster rates under relaxed machine-count constraints, and mSMSE, a multiround, iterative smoothing procedure that achieves the optimal nonparametric rate and quadratic convergence with explicit inference procedures.The authors establish asymptotic normality and bias-corrected confidence intervals for the estimators, extend the methods to data heterogeneity (covariate and coefficient shift), and adapt the framework to high-dimensional sparse settings via a Dantzig-selector-like update, supported by extensive simulations.Overall, the smoothing-based divide-and-conquer framework enables scalable, statistically efficient distributed estimation and inference for nondifferentiable semiparametric models, with practical implications for large-scale private or decentralized data analysis.

Abstract

The development of modern technology has enabled data collection of unprecedented size, which poses new challenges to many statistical estimation and inference problems. This paper studies the maximum score estimator of a semi-parametric binary choice model under a distributed computing environment without pre-specifying the noise distribution. An intuitive divide-and-conquer estimator is computationally expensive and restricted by a non-regular constraint on the number of machines, due to the highly non-smooth nature of the objective function. We propose (1) a one-shot divide-and-conquer estimator after smoothing the objective to relax the constraint, and (2) a multi-round estimator to completely remove the constraint via iterative smoothing. We specify an adaptive choice of kernel smoother with a sequentially shrinking bandwidth to achieve the superlinear improvement of the optimization error over the multiple iterations. The improved statistical accuracy per iteration is derived, and a quadratic convergence up to the optimal statistical error rate is established. We further provide two generalizations to handle the heterogeneity of datasets and high-dimensional problems where the parameter of interest is sparse.

Distributed Estimation and Inference for Semi-parametric Binary Response Models

TL;DR

This paper addresses estimation and inference for a semiparametric binary response model in a distributed data setting without assuming a noise distribution, focusing on the maximum score framework which is nondifferentiable and nonregular.It introduces two distributed approaches: Avg-SMSE, a one-shot divide-and-conquer method that smooths the objective to achieve faster rates under relaxed machine-count constraints, and mSMSE, a multiround, iterative smoothing procedure that achieves the optimal nonparametric rate and quadratic convergence with explicit inference procedures.The authors establish asymptotic normality and bias-corrected confidence intervals for the estimators, extend the methods to data heterogeneity (covariate and coefficient shift), and adapt the framework to high-dimensional sparse settings via a Dantzig-selector-like update, supported by extensive simulations.Overall, the smoothing-based divide-and-conquer framework enables scalable, statistically efficient distributed estimation and inference for nondifferentiable semiparametric models, with practical implications for large-scale private or decentralized data analysis.

Abstract

The development of modern technology has enabled data collection of unprecedented size, which poses new challenges to many statistical estimation and inference problems. This paper studies the maximum score estimator of a semi-parametric binary choice model under a distributed computing environment without pre-specifying the noise distribution. An intuitive divide-and-conquer estimator is computationally expensive and restricted by a non-regular constraint on the number of machines, due to the highly non-smooth nature of the objective function. We propose (1) a one-shot divide-and-conquer estimator after smoothing the objective to relax the constraint, and (2) a multi-round estimator to completely remove the constraint via iterative smoothing. We specify an adaptive choice of kernel smoother with a sequentially shrinking bandwidth to achieve the superlinear improvement of the optimization error over the multiple iterations. The improved statistical accuracy per iteration is derived, and a quadratic convergence up to the optimal statistical error rate is established. We further provide two generalizations to handle the heterogeneity of datasets and high-dimensional problems where the parameter of interest is sparse.
Paper Structure (37 sections, 4 theorems, 335 equations, 4 figures, 11 tables, 3 algorithms)

This paper contains 37 sections, 4 theorems, 335 equations, 4 figures, 11 tables, 3 algorithms.

Key Result

Lemma B.1

Under Assumptions A1--A5, if $h=o\left(1\right)$, $\left\lVert\bm{\beta}-\bm{\beta}^*\right\rVert_2 \leq \delta$, $\delta=o\left(\min\{h^{\alpha/2}, h/\sqrt{p\log n}\}\right)$ then and for any $\bm{v} \in \mathbb{R}^p \setminus\{\bm{0}\}$.

Figures (4)

  • Figure 1: The coverage rates of different methods with homoscedastic normal noise
  • Figure 2: The CPU time (in seconds) of (mSMSE) as we increase $m$, $L$, $p$ in subfigures (a), (b), (c), respectively.
  • Figure 3: Coverage rates for different methods with $p=20$ and homoscedastic normal noise.
  • Figure 4: The $L_2$ estimation errors of different methods in the high-dimensional setting.

Theorems & Definitions (28)

  • Remark 1
  • Remark 2
  • Remark 3
  • Remark 4: Superefficiency phenomenon
  • Remark 5
  • Remark 6
  • Remark 7
  • Remark 8
  • Remark 9
  • proof : Proof of Proposition \ref{['thm:1step-ld']}
  • ...and 18 more