Optimal convex $M$-estimation via score matching

Oliver Y. Feng; Yu-Chun Kao; Min Xu; Richard J. Samworth

Optimal convex $M$-estimation via score matching

Oliver Y. Feng, Yu-Chun Kao, Min Xu, Richard J. Samworth

TL;DR

The paper develops a data-driven approach to convex M-estimation in linear regression by leveraging score matching with the Fisher divergence to obtain the best decreasing score function under a convexity constraint. The key idea is the antitonic score projection, which yields a population-optimal score $\psi_0^*$ and its corresponding convex loss $\ell_0^*$ through a log-concave Fisher divergence projection, even when the error density is not log-concave. Semiparametric estimation is achieved via an alternating procedure that estimates $eta$ and the projected score from residuals, with three-fold cross-fitting ensuring $\, ext{sqrt} obreakspace{n}$-consistency and asymptotic normality that attains an antitonic efficiency lower bound $i^*(p_0)$. In heavy-tailed scenarios like Cauchy errors, the resultingHuber-like loss $\,\, extell_0^*$ provides substantial robustness with minimal loss of efficiency (ARE$^*$ near 0.88), and numerical experiments with the R package asm corroborate both accuracy and computational efficiency. Overall, the framework unites shape-constrained estimation, Fisher-information-inspired projections, and robust convex optimization to deliver practically efficient, statistically near-optimal linear regression under unknown error distributions.

Abstract

In the context of linear regression, we construct a data-driven convex loss function with respect to which empirical risk minimisation yields optimal asymptotic variance in the downstream estimation of the regression coefficients. At the population level, the negative derivative of the optimal convex loss is the best decreasing approximation of the derivative of the log-density of the noise distribution. This motivates a fitting process via a nonparametric extension of score matching, corresponding to a log-concave projection of the noise distribution with respect to the Fisher divergence. At the sample level, our semiparametric estimator is computationally efficient, and we prove that it attains the minimal asymptotic covariance among all convex $M$-estimators. As an example of a non-log-concave setting, the optimal convex loss function for Cauchy errors is Huber-like, and our procedure yields asymptotic efficiency greater than $0.87$ relative to the maximum likelihood estimator of the regression coefficients that uses oracle knowledge of this error distribution. In this sense, we provide robustness and facilitate computation without sacrificing much statistical efficiency. Numerical experiments using our accompanying R package 'asm' confirm the practical merits of our proposal.

Optimal convex $M$-estimation via score matching

TL;DR

and its corresponding convex loss

through a log-concave Fisher divergence projection, even when the error density is not log-concave. Semiparametric estimation is achieved via an alternating procedure that estimates

and the projected score from residuals, with three-fold cross-fitting ensuring

-consistency and asymptotic normality that attains an antitonic efficiency lower bound

. In heavy-tailed scenarios like Cauchy errors, the resultingHuber-like loss

provides substantial robustness with minimal loss of efficiency (ARE

near 0.88), and numerical experiments with the R package asm corroborate both accuracy and computational efficiency. Overall, the framework unites shape-constrained estimation, Fisher-information-inspired projections, and robust convex optimization to deliver practically efficient, statistically near-optimal linear regression under unknown error distributions.

Abstract

-estimators. As an example of a non-log-concave setting, the optimal convex loss function for Cauchy errors is Huber-like, and our procedure yields asymptotic efficiency greater than

relative to the maximum likelihood estimator of the regression coefficients that uses oracle knowledge of this error distribution. In this sense, we provide robustness and facilitate computation without sacrificing much statistical efficiency. Numerical experiments using our accompanying R package 'asm' confirm the practical merits of our proposal.

Paper Structure (31 sections, 46 theorems, 382 equations, 14 figures, 5 tables)

This paper contains 31 sections, 46 theorems, 382 equations, 14 figures, 5 tables.

Introduction
Related work
Notation
The antitonic score projection
Construction and basic properties
The log-concave Fisher divergence projection
Examples
Semiparametric M-estimation via antitonic score matching
Warm-up: Estimation of the projected score function from direct observations
Linear regression: alternating algorithm outline
Linear regression with symmetric errors
Linear regression with an intercept term
Inference
Numerical experiments
Estimation accuracy
...and 16 more sections

Key Result

Lemma 1

Let $P_0$ be a distribution with a uniformly continuous density $p_0$ on $\mathbb{R}$. Let $F_0 \colon [-\infty,\infty] \to [0,1]$ be the corresponding distribution function, and for $u \in [0,1]$, define Then both $J_0$ and its least concave majorant $\hat{J}_0$ on $[0,1]$ are continuous, with $p_0 = J_0 \circ F_0$ on $\mathbb{R}$, and is decreasing and right-continuous as a function from $\mat

Figures (14)

Figure 1: Top row: Plots of the score function $\psi_0$ (green) and projected score function $\psi_0^*$ (blue); Bottom row: their respective negative antiderivatives, namely the negative log-density $-\log p_0$ (green) and optimal convex loss function $\ell_0^*$ (blue), for each of the following non-log-concave distributions (from left to right): (a) Student's $t_2$; (b) symmetrised Pareto \ref{['eq:pareto-sym']} with $\sigma = 2$ and $\alpha = 3$; (c) Gaussian mixture $0.4 N(-2,1) + 0.6 N(2,1)$.
Figure 2: Left: The density quantile function $J_0$ and its least concave majorant $\hat{J}_0$ for a standard Cauchy density. Right: The corresponding score functions $\psi_0$ and $\psi_0^*$.
Figure 3: Illustration of the construction in the proof of Proposition \ref{['prop:Vp0-MLE']}. Left: Plot of the density $p_0$ (black) together with its log-concave maximum likelihood projection $p_0^{\mathrm{ML}}$ (red) and Fisher divergence projection $p_0^*$ (blue). Right: Plot of the corresponding score functions.
Figure 4: Left: The negative log-density $\ell_0 = -\log p_0$ and the optimal convex loss function $\ell_0^* = -\log p_0^*$ when $p_0$ is the standard Cauchy density. Right: The corresponding densities $p_0$ and $p_0^*$.
Figure 5: Plot of the asymptotic relative efficiency $r(K)$ of the Huber $M$-estimator $\hat{\beta}_{\psi_K}$ compared with the optimal convex $M$-estimator.
...and 9 more figures

Theorems & Definitions (97)

Lemma 1
Theorem 2
Remark 3
Corollary 4
Lemma 5
Lemma 6
Proposition 7
Lemma 8
Remark 9
Proposition 10
...and 87 more

Optimal convex $M$-estimation via score matching

TL;DR

Abstract

Optimal convex $M$-estimation via score matching

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (97)