On Efficient and Scalable Computation of the Nonparametric Maximum Likelihood Estimator in Mixture Models

Yangjing Zhang; Ying Cui; Bodhisattva Sen; Kim-Chuan Toh

On Efficient and Scalable Computation of the Nonparametric Maximum Likelihood Estimator in Mixture Models

Yangjing Zhang, Ying Cui, Bodhisattva Sen, Kim-Chuan Toh

TL;DR

This paper proposes, leveraging the sparsity of the solution, an efficient and scalable semismooth Newton based augmented Lagrangian method (ALM), which beats the state-of-the-art methods and proposes new denoising estimands in this context along with their consistent estimates.

Abstract

In this paper we study the computation of the nonparametric maximum likelihood estimator (NPMLE) in multivariate mixture models. Our first approach discretizes this infinite dimensional convex optimization problem by fixing the support points of the NPMLE and optimizing over the mixture proportions. In this context we propose, leveraging the sparsity of the solution, an efficient and scalable semismooth Newton based augmented Lagrangian method (ALM). Our algorithm beats the state-of-the-art methods~\cite{koenker2017rebayes, kim2020fast} and can handle $n \approx 10^6$ data points with $m \approx 10^4$ support points. Our second procedure, which combines the expectation-maximization (EM) algorithm with the ALM approach above, allows for joint optimization of both the support points and the probability weights. For both our algorithms we provide formal results on their (superlinear) convergence properties. The computed NPMLE can be immediately used for denoising the observations in the framework of empirical Bayes. We propose new denoising estimands in this context along with their consistent estimates. Extensive numerical experiments are conducted to illustrate the effectiveness of our methods. In particular, we employ our procedures to analyze two astronomy data sets: (i) Gaia-TGAS Catalog~\cite{anderson2018improving} containing $n \approx 1.4 \times 10^6$ data points in two dimensions, and (ii) the $d=19$ dimensional data set from the APOGEE survey~\cite{majewski2017apache} with $n \approx 2.7 \times 10^4$.

On Efficient and Scalable Computation of the Nonparametric Maximum Likelihood Estimator in Mixture Models

TL;DR

Abstract

data points with

support points. Our second procedure, which combines the expectation-maximization (EM) algorithm with the ALM approach above, allows for joint optimization of both the support points and the probability weights. For both our algorithms we provide formal results on their (superlinear) convergence properties. The computed NPMLE can be immediately used for denoising the observations in the framework of empirical Bayes. We propose new denoising estimands in this context along with their consistent estimates. Extensive numerical experiments are conducted to illustrate the effectiveness of our methods. In particular, we employ our procedures to analyze two astronomy data sets: (i) Gaia-TGAS Catalog~\cite{anderson2018improving} containing

data points in two dimensions, and (ii) the

dimensional data set from the APOGEE survey~\cite{majewski2017apache} with

Paper Structure (25 sections, 6 theorems, 82 equations, 16 figures, 6 tables, 1 algorithm)

This paper contains 25 sections, 6 theorems, 82 equations, 16 figures, 6 tables, 1 algorithm.

Introduction
An augmented Lagrangian method for solving \ref{['primal-0']}
Preliminaries
An augmented Lagrangian method for the dual of \ref{['primal-0']}
Semismooth Newton method for the ALM subproblem
Convergence results for the ALM and the semismooth Newton method
Comparison of the computational cost with other second order methods
The partial EM algorithm for solving \ref{['adaptive supports 0']}
Denoising via optimal transport
NPMLE based estimation of the OT based estimands
Numerical experiments
Initialization and stopping criteria of the methods
Stopping conditions
Implementation details of our ALM algorithm
Numerical results when $d=1$
...and 10 more sections

Key Result

Proposition 1

Let $\{\sigma_k\}_{k\geq 0}$ be a nondecreasing positive sequence converging to $\sigma_\infty \leq \infty$. Let $\{(u^k, v^k, x^k, y^k)\}_{k\geq 1}$ be the sequence generated by the ALM with each subproblem satisfying the stopping criterion (S1). Then the primal sequence $\{(x^k, y^k)\}_{k\geq 1}$

Figures (16)

Figure 1: Left: The noisy CMD corresponding to data $\{Y_i\}_{i=1}^n \subset \mathbb{R}^2$ for $n \approx 1.4 \times 10^6$ stars obtained from the Gaia-TGAS Catalog. Middle: The denoised CMD using empirical Bayes estimates $\{\widehat{\theta}_i\}_{i=1}^n$ based on the NPMLE computed via the augmented Lagrangian method (see Section \ref{['sec:ALM0']}). Right: The denoised CMD computed via the partial expectation maximization algorithm (see Section \ref{['sec:adaptive supports']}). Both the denoised CMDs have rather sharp tails in the bottom of the plots (i.e., the main sequence) and the top right (i.e., the tip of the red-giant branch) as well as a definitive cluster in the center-right (i.e., the red clump).
Figure 2: The distribution of the singular values $\sigma_i(L)$ of $L$ computed from (left) the APOGEE data (here $n=27,135$, and $m=1,000$) and (right) the synthetic Example 3(a) (here $n=5,000$, $m=1,000$) as $d$ varies. The top $20$ singular values are excluded so that the others are not overshadowed. Observe the slow decay of the singular values when $d >1$.
Figure 3: Plots of the raw data (in blue) with $n= 5,000$ in $d=2$, the corresponding empirical Bayes estimates (in red), the true $G^*$ (in black), and $\widehat{G}_n$ (in black dots) obtained from our ALM. Here half of the true signals $\theta_i \in \mathbb{R}^d$ are drawn uniformly at random from each of the two concentric circles of radii 2 and 6 respectively (centered at $(0,0) \in \mathbb{R}^2$), and $Y_i \mid \theta_i \sim {\cal N}(\theta_i, I_2)$ for $i=1,\dots,n$. Observe that some of the empirical Bayes estimates $\widehat{\theta}_i$'s are far from the support of $G^*$ (and $\widehat{G}_n$).
Figure 4: Computational time (in seconds) of the ALM, mix-SQP, and REBayes with changing $n$ and $m$ for Example 2 (averaged over $10$ replications).
Figure 5: Results for the $d=2$ dimensional Gaia-TGAS data obtained from: (a) mix-SQP (with $m = 20^2$), (b) ALM (with $m = 100^2$). The number of support points of $\widehat{G}_n$ (see the (2,2) subplots) are: (a) $307$, (b) $1,677$, where the size of each support point plotted is proportional to its weight. The run times are: (a) 28 minutes (failed to converge within 100 iterations), (b) 472 minutes. With denser grid points we are able to obtain sharper denoised estimates that reveal finer details of the CMD.
...and 11 more figures

Theorems & Definitions (11)

Proposition 1
Proposition 2
Proposition 3
Proposition 4
Theorem 1
proof
proof
Definition 1: Push-forward
Definition 2: Monge's problem
Definition 3: Kantorovich's problem
...and 1 more

On Efficient and Scalable Computation of the Nonparametric Maximum Likelihood Estimator in Mixture Models

TL;DR

Abstract

On Efficient and Scalable Computation of the Nonparametric Maximum Likelihood Estimator in Mixture Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (11)