Table of Contents
Fetching ...

Proximal Iteration for Nonlinear Adaptive Lasso

Nathan Wycoff, Lisa O. Singh, Ali Arab, Katharine M. Donato

TL;DR

This work develops a unified proximal-gradient framework to debias and structure-sparsify nonlinear models by learning penalty coefficients $\boldsymbol{\lambda}$ jointly with model parameters $\boldsymbol{\beta}$ in an Adaptive Lasso setting. By introducing novel proximal operators for the variable-penalty $\ell_1$ term and its log-regularized form, it enables efficient optimization for general differentiable losses and arbitrary sparsity structures. The authors establish convergence guarantees under a global-descent regime, derive asymptotic properties with diffuse priors, and demonstrate oracle-like performance under appropriate priors. Large-scale experiments across non-Gaussian regression, as well as real-world case studies in vaccination behavior and international migration, show competitive speed and improved accuracy, highlighting practical applicability to complex, high-dimensional problems.

Abstract

Augmenting a smooth cost function with an $\ell_1$ penalty allows analysts to efficiently conduct estimation and variable selection simultaneously in sophisticated models and can be efficiently implemented using proximal gradient methods. However, one drawback of the $\ell_1$ penalty is bias: nonzero parameters are underestimated in magnitude, motivating techniques such as the Adaptive Lasso which endow each parameter with its own penalty coefficient. But it's not clear how these parameter-specific penalties should be set in complex models. In this article, we study the approach of treating the penalty coefficients as additional decision variables to be learned in a \textit{Maximum a Posteriori} manner, developing a proximal gradient approach to joint optimization of these together with the parameters of any differentiable cost function. Beyond reducing bias in estimates, this procedure can also encourage arbitrary sparsity structure via a prior on the penalty coefficients. We compare our method to implementations of specific sparsity structures for non-Gaussian regression on synthetic and real datasets, finding our more general method to be competitive in terms of both speed and accuracy. We then consider nonlinear models for two case studies: COVID-19 vaccination behavior and international refugee movement, highlighting the applicability of this approach to complex problems and intricate sparsity structures.

Proximal Iteration for Nonlinear Adaptive Lasso

TL;DR

This work develops a unified proximal-gradient framework to debias and structure-sparsify nonlinear models by learning penalty coefficients jointly with model parameters in an Adaptive Lasso setting. By introducing novel proximal operators for the variable-penalty term and its log-regularized form, it enables efficient optimization for general differentiable losses and arbitrary sparsity structures. The authors establish convergence guarantees under a global-descent regime, derive asymptotic properties with diffuse priors, and demonstrate oracle-like performance under appropriate priors. Large-scale experiments across non-Gaussian regression, as well as real-world case studies in vaccination behavior and international migration, show competitive speed and improved accuracy, highlighting practical applicability to complex, high-dimensional problems.

Abstract

Augmenting a smooth cost function with an penalty allows analysts to efficiently conduct estimation and variable selection simultaneously in sophisticated models and can be efficiently implemented using proximal gradient methods. However, one drawback of the penalty is bias: nonzero parameters are underestimated in magnitude, motivating techniques such as the Adaptive Lasso which endow each parameter with its own penalty coefficient. But it's not clear how these parameter-specific penalties should be set in complex models. In this article, we study the approach of treating the penalty coefficients as additional decision variables to be learned in a \textit{Maximum a Posteriori} manner, developing a proximal gradient approach to joint optimization of these together with the parameters of any differentiable cost function. Beyond reducing bias in estimates, this procedure can also encourage arbitrary sparsity structure via a prior on the penalty coefficients. We compare our method to implementations of specific sparsity structures for non-Gaussian regression on synthetic and real datasets, finding our more general method to be competitive in terms of both speed and accuracy. We then consider nonlinear models for two case studies: COVID-19 vaccination behavior and international refugee movement, highlighting the applicability of this approach to complex problems and intricate sparsity structures.

Paper Structure

This paper contains 37 sections, 13 theorems, 29 equations, 9 figures, 1 table, 2 algorithms.

Key Result

Lemma 1

The marginal cost of eq:prox_prob with respect to $\lambda$ (i.e. with $\beta$ profiled out) is the following piecewise quadratic expression: where the changepoint $\lambda=\frac{|\beta_0|}{s_\beta}$ is the point where $\lambda$ is just large enough to push $\beta$ to zero.

Figures (9)

  • Figure 1: Left: The function $g(\beta,\lambda)=\lambda|\beta|$. Subsequently: The proximal cost of as a function of $\beta$ and $\lambda$ (center) and marginal for $\lambda$ (right) with $\lambda_0 = \beta_0=1; s_\lambda=s_\beta = 2$.
  • Figure 2: The Action of the Proximal Operator: Plots of the reduced proximal operator (Eq \ref{['eq:reduced_prox']}) for various fixed $b:=s_x s_\lambda<1$ and with $\lambda_0,\frac{|x_0|}{s_x} \in (0,2)$. Values $b=s_x s_\lambda\in\{0.1,0.35,0.65,0.99\}$ are shown left to right.
  • Figure 3: Comparison on synthetic data with independent sparsity.
  • Figure 4: Comparison on synthetic data with group sparsity.
  • Figure 5: Comparison on synthetic data with hierarchical sparsity.
  • ...and 4 more figures

Theorems & Definitions (18)

  • Lemma 1
  • Theorem 2
  • Remark 3
  • Remark 4
  • Theorem 5
  • Theorem 6: li2015global
  • Lemma 7
  • Theorem 8
  • Remark 9
  • Remark 10
  • ...and 8 more