Table of Contents
Fetching ...

Smoothing the Edges: Smooth Optimization for Sparse Regularization using Hadamard Overparametrization

Chris Kolb, Christian L. Müller, Bernd Bischl, David Rügamer

TL;DR

This work introduces a general, differentiable optimization transfer for explicit sparse regularization by overparametrizing targeted parameter subsets through Hadamard-based mappings. By constructing smooth surrogate penalties via a smooth variational form (SVF) and suitable parametrizations (including Hadamard products, differences, and powers), the authors prove equivalence of global and local minima between the original non-smooth problem and the smooth surrogate, thereby enabling standard gradient-based optimization without bespoke solvers. They systematically develop depth-$k$ and group-structured parametrizations, extend to non-integer depths with Hadamard powers, and address practical considerations like parameter sharing and initialization. Numerical experiments across high-dimensional regression, DNN pruning, and structured CNN sparsity demonstrate that the smooth surrogates reproduce or outperform traditional non-smooth regularizers while remaining compatible with SGD. The framework offers a versatile toolkit for integrating sparse regularization into differentiable models with broad applicability and theoretical guarantees on the preservation of minimizers.

Abstract

We present a framework for smooth optimization of explicitly regularized objectives for (structured) sparsity. These non-smooth and possibly non-convex problems typically rely on solvers tailored to specific models and regularizers. In contrast, our method enables fully differentiable and approximation-free optimization and is thus compatible with the ubiquitous gradient descent paradigm in deep learning. The proposed optimization transfer comprises an overparameterization of selected parameters and a change of penalties. In the overparametrized problem, smooth surrogate regularization induces non-smooth, sparse regularization in the base parametrization. We prove that the surrogate objective is equivalent in the sense that it not only has identical global minima but also matching local minima, thereby avoiding the introduction of spurious solutions. Additionally, our theory establishes results of independent interest regarding matching local minima for arbitrary, potentially unregularized, objectives. We comprehensively review sparsity-inducing parametrizations across different fields that are covered by our general theory, extend their scope, and propose improvements in several aspects. Numerical experiments further demonstrate the correctness and effectiveness of our approach on several sparse learning problems ranging from high-dimensional regression to sparse neural network training.

Smoothing the Edges: Smooth Optimization for Sparse Regularization using Hadamard Overparametrization

TL;DR

This work introduces a general, differentiable optimization transfer for explicit sparse regularization by overparametrizing targeted parameter subsets through Hadamard-based mappings. By constructing smooth surrogate penalties via a smooth variational form (SVF) and suitable parametrizations (including Hadamard products, differences, and powers), the authors prove equivalence of global and local minima between the original non-smooth problem and the smooth surrogate, thereby enabling standard gradient-based optimization without bespoke solvers. They systematically develop depth- and group-structured parametrizations, extend to non-integer depths with Hadamard powers, and address practical considerations like parameter sharing and initialization. Numerical experiments across high-dimensional regression, DNN pruning, and structured CNN sparsity demonstrate that the smooth surrogates reproduce or outperform traditional non-smooth regularizers while remaining compatible with SGD. The framework offers a versatile toolkit for integrating sparse regularization into differentiable models with broad applicability and theoretical guarantees on the preservation of minimizers.

Abstract

We present a framework for smooth optimization of explicitly regularized objectives for (structured) sparsity. These non-smooth and possibly non-convex problems typically rely on solvers tailored to specific models and regularizers. In contrast, our method enables fully differentiable and approximation-free optimization and is thus compatible with the ubiquitous gradient descent paradigm in deep learning. The proposed optimization transfer comprises an overparameterization of selected parameters and a change of penalties. In the overparametrized problem, smooth surrogate regularization induces non-smooth, sparse regularization in the base parametrization. We prove that the surrogate objective is equivalent in the sense that it not only has identical global minima but also matching local minima, thereby avoiding the introduction of spurious solutions. Additionally, our theory establishes results of independent interest regarding matching local minima for arbitrary, potentially unregularized, objectives. We comprehensively review sparsity-inducing parametrizations across different fields that are covered by our general theory, extend their scope, and propose improvements in several aspects. Numerical experiments further demonstrate the correctness and effectiveness of our approach on several sparse learning problems ranging from high-dimensional regression to sparse neural network training.
Paper Structure (61 sections, 34 theorems, 84 equations, 17 figures, 3 tables)

This paper contains 61 sections, 34 theorems, 84 equations, 17 figures, 3 tables.

Key Result

Lemma 2.4

If $(\hat{\bm{\psi}},\hat{\bm{\beta}})$ is a local minimizer of $\mathcal{P}(\bm{\psi},\bm{\beta})$, and $\mathcal{K}(\bm{\xi})$ is a continuous surjection, then all $(\hat{\bm{\psi}},\hat{\bm{\xi}})$ such that $\hat{\bm{\xi}}\in\mathcal{K}^{-1}(\hat{\bm{\beta}})$ are local minimizers of $\mathcal{P

Figures (17)

  • Figure 1: Illustration of smooth optimization transfer. Left: univariate lasso problem $\mathcal{P}(\beta)=(1-\frac{3}{2} \beta)^2+2|\beta|$ (red line indicates the global minimizer $\hat{\beta}$). Middle: contours of the equivalent smooth surrogate $\mathcal{Q}(u,v)=(1-\frac{3}{2} uv)^2+u^2+v^2$ using a Hadamard product parametrization (\ref{['eq:hpp-def']}) with $\mathcal{K}(u,v)=uv=\beta$. Both global minimizers (dots) map to $\mathcal{K}(\hat{u},\hat{v})=\hat{\beta}$. Right: non-convex surface of higher-dimensional $\mathcal{Q}(u,v)$.
  • Figure 2: Relationship between local minimizer $\hat{\bm{\xi}}$ of $\mathcal{Q}$, the induced minimizer $\mathcal{K}(\hat{\bm{\xi}})=\hat{\bm{\beta}}$ of $\mathcal{P}$, and the cont. solution mapping $\hat{\bm{\xi}}(\bm{\beta})$ of the SVF. Left: solid curves show two fibers $\mathcal{K}^{-1}(\hat{\bm{\beta}})$ (red) and $\mathcal{K}^{-1}(\tilde{\bm{\beta}})$ (blue). The solution map $\hat{\bm{\xi}}(\bm{\beta})$ (dashed green) maps to minimizers of the SVF for varying $\bm{\beta}$, where $\mathcal{Q}$ equals $\mathcal{P}$. Right: concrete example showing scalar parametrization $\beta_j=\mathcal{K}(u_j,v_j)=u_j v_j$, and surrogate $\ell_2$ regularization $\mathcal{R}_{\bm{\xi}}(u_j,v_j)=u_j^2+v_j^2$. Each branch of $\mathcal{K}^{-1}(\beta_j)$ has a unique minimal-norm point (vertices). The $\ell_2$ penalty there is $2|u_j v_j|=2|\beta_j|$, inducing $\ell_1$ regularization.
  • Figure 3: Diagonal linear networks corresponding to different parametrizations of a linear predictor: a) HPP ($\ell_1$), b) HDP ($\ell_1$), c) Network corresponding to a structure-inducing parametrization (GHPP for $\ell_{2,1}$, cf. \ref{['sec:group-lasso-vanilla']}) with grouping layer. Left nodes are inputs and right-most node the output.
  • Figure 4: a) Illustration of $\ell_1$ optimization transfer using HPP and surrogate $\ell_2$ regularization on a scalar $\beta_j=10$ (lower plane). The hyperbolic paraboloid (blue/green) shows the parametrization $\mathcal{K}(u_j,v_j)=u_jv_j$ and the elliptic paraboloid (orange) the $\ell_2$ surrogate. The fiber $\mathcal{K}^{-1}(10)$ defines a hyperbola (black), whose two vertices achieve minimal a min. $\ell_2$ penalty of $2|10|=20$ (upper plane) over $\mathcal{K}^{-1}(10)$. b) Majorization of overparametrized $\ell_1$ term $2|u_jv_j|$ (blue/green) through $\ell_2$ penalty. The $\ell_2$ (orange) is tightly "hugged" by the $\ell_1$ term. The difference of both regularizers attains zero at perpendicular lines intersecting at the origin, illustrating the u.h.c. of the SVF solution map.
  • Figure 5: Deep diagonal linear networks corresponding to different parametrizations of a linear predictor set-up. a) HPP (for $\ell_{2/k}$), b)$\text{GHPP}_k$ (for $\ell_{2,2/k}$), c)$\text{GHPP}_{k_1,k_1+k_2}$ (for $\ell_{2/k_1,2/(k_1+k_2)})$. The depth up to and including the grouping layer is $k_1$, followed by $k_2=k-k_1$ more diagonal layers. Nodes on the left represent input features and the single node on the right the output.
  • ...and 12 more figures

Theorems & Definitions (42)

  • Definition 2.1: Smooth optimization transfer
  • Definition 2.2: Equivalence of optimization problems
  • Definition 2.3: Local openness
  • Lemma 2.4
  • Lemma 2.5
  • Definition 2.6: Smooth variational form
  • Definition 2.7: Upper hemicontinuity
  • Lemma 2.8
  • Lemma 2.9
  • Theorem 2.10: Smooth optimization transfer for sparse regularization
  • ...and 32 more