Table of Contents
Fetching ...

Differentiable Sparsity via $D$-Gating: Simple and Versatile Structured Penalization

Chris Kolb, Laetitia Frost, Bernd Bischl, David Rügamer

TL;DR

D-Gating introduces a differentiable overparameterization that splits each parameter group into a primary weight and multiple gating factors, enabling SGD-compatible optimization while achieving structured sparsity. The authors prove that, at balance, the D-Gating objective is equivalent to directly optimizing the non-differentiable $L_{2,2/D}$ group penalty, and show that the gating imbalance decays exponentially under gradient flow and geometrically under SGD. The approach yields strong sparsity-accuracy tradeoffs across vision, language, and tabular tasks without post-hoc pruning and with negligible overhead, highlighting its modularity and practicality for diverse architectures. This work advances principled, differentiable structured sparsity that integrates smoothly with standard training pipelines and provides theoretical guarantees alongside broad empirical validation. Overall, D-Gating offers a versatile, theoretically sound pathway to scalable structured sparsity in modern deep learning models.

Abstract

Structured sparsity regularization offers a principled way to compact neural networks, but its non-differentiability breaks compatibility with conventional stochastic gradient descent and requires either specialized optimizers or additional post-hoc pruning without formal guarantees. In this work, we propose $D$-Gating, a fully differentiable structured overparameterization that splits each group of weights into a primary weight vector and multiple scalar gating factors. We prove that any local minimum under $D$-Gating is also a local minimum using non-smooth structured $L_{2,2/D}$ penalization, and further show that the $D$-Gating objective converges at least exponentially fast to the $L_{2,2/D}$-regularized loss in the gradient flow limit. Together, our results show that $D$-Gating is theoretically equivalent to solving the original group sparsity problem, yet induces distinct learning dynamics that evolve from a non-sparse regime into sparse optimization. We validate our theory across vision, language, and tabular tasks, where $D$-Gating consistently delivers strong performance-sparsity tradeoffs and outperforms both direct optimization of structured penalties and conventional pruning baselines.

Differentiable Sparsity via $D$-Gating: Simple and Versatile Structured Penalization

TL;DR

D-Gating introduces a differentiable overparameterization that splits each parameter group into a primary weight and multiple gating factors, enabling SGD-compatible optimization while achieving structured sparsity. The authors prove that, at balance, the D-Gating objective is equivalent to directly optimizing the non-differentiable group penalty, and show that the gating imbalance decays exponentially under gradient flow and geometrically under SGD. The approach yields strong sparsity-accuracy tradeoffs across vision, language, and tabular tasks without post-hoc pruning and with negligible overhead, highlighting its modularity and practicality for diverse architectures. This work advances principled, differentiable structured sparsity that integrates smoothly with standard training pipelines and provides theoretical guarantees alongside broad empirical validation. Overall, D-Gating offers a versatile, theoretically sound pathway to scalable structured sparsity in modern deep learning models.

Abstract

Structured sparsity regularization offers a principled way to compact neural networks, but its non-differentiability breaks compatibility with conventional stochastic gradient descent and requires either specialized optimizers or additional post-hoc pruning without formal guarantees. In this work, we propose -Gating, a fully differentiable structured overparameterization that splits each group of weights into a primary weight vector and multiple scalar gating factors. We prove that any local minimum under -Gating is also a local minimum using non-smooth structured penalization, and further show that the -Gating objective converges at least exponentially fast to the -regularized loss in the gradient flow limit. Together, our results show that -Gating is theoretically equivalent to solving the original group sparsity problem, yet induces distinct learning dynamics that evolve from a non-sparse regime into sparse optimization. We validate our theory across vision, language, and tabular tasks, where -Gating consistently delivers strong performance-sparsity tradeoffs and outperforms both direct optimization of structured penalties and conventional pruning baselines.

Paper Structure

This paper contains 58 sections, 6 theorems, 71 equations, 22 figures, 5 tables, 1 algorithm.

Key Result

Lemma 1

Let $(\bm{\omega},\bm{\Gamma})$ be $D$-Gating parameters satisfying $\mathbf{w}_j = \bm{\omega}_j\,\prod_{d=1}^{D-1}\gamma_{j,d} \quad \text{for } j \in [J]$. If $(\bm{\omega},\bm{\Gamma})$ is a stationary point of the $L_2$-regularized objective $\mathcal{L}(\bm{\omega},\bm{\Gamma})$ with $\lambda>

Figures (22)

  • Figure 1: Parameter trajectories for a two-feature squared loss toy objective with non-convex $L_{2,2/D}$ regularization $\mathcal{L}(\mathbf{w})=(y-x_1\mathrm{w}_1-x_2\mathrm{w}_2)^2+\lambda \Vert \mathbf{w} \Vert_2^{2/D}$ whose global minimizer is $(\mathrm{w}_1^{\ast},\mathrm{w}_2^{\ast})=(0,0)$. Left: Failure of direct gradient descent (GD) optimization to converge to $\bm{0}$ because of the non-differentiability at the origin. Right: $D$-gated objective where $\mathbf{w}=\bm{\omega} \cdot \prod_{d=1}^{D-1}\gamma_{d}$, converging smoothly to $\bm{0}$.
  • Figure 2: Overview of differentiable $D$-Gating method for structured sparsity (cf. \ref{['alg:d-gating-train']}). For simplicity, we show $D$-Gating visually for a single fully-connected layer with input-wise grouping (colors), but our approach extends to arbitrary network components such as convolutional filters or attention heads. We proceed by applying $D$-Gating (red nodes and their connections) to the neural network weight and running SGD on the gating parameters with weight decay. After training, the weights are collapsed again and the zero structures removed, with the resulting sparse minimizers also being minimizers of the non-smooth $L_{2,2/D}$-regularized objective.
  • Figure 3: Evolution of imbalance during SGD of a neuron-wise $D$-gated LeNet-300-100 for $D\in\{2,3,4\}$ (left to right). As predicted by our theory, the losses converge exponentially, with the rate increasing in $\lambda$ and decreasing in $D$.
  • Figure 4: Regularization paths for sparse linear regression task using $D$-Gating. Left: Test RMSE vs $\lambda$. The curves for $2$-Gating and group lasso coincide beyond a certain $\lambda$, but are outperformed by $D$-Gating with $D>2$. Middle: Group sparsity of $2$-Gating coincides with group lasso solution. Dashed grey line indicates optimal $\lambda$ for all models. Deeper gating yields sparser solutions. Right: Transition of $2$-Gating to group lasso solution beyond a certain $\lambda$ coincides with zero imbalance attained after training. Direct optimization with GD yields notably different regularization paths far from the global minima. The dashed black line indicates $0$ misalignment at the end of traininig. Means and $95\%$ confidence intervals over ten simulations are shown for the left two plots.
  • Figure 5: Comparison of feature selection methods. Means and std. over $5$ random initializations are reported.
  • ...and 17 more figures

Theorems & Definitions (13)

  • Definition 1: $D$-Gating
  • Lemma 1: Balancedness at stationary points
  • Corollary 1: Loss simplification at balanced gating parameters
  • Theorem 1: Equivalence of $D$-Gating and $L_{2,2/D}$ regularization
  • Lemma 2: Exponential decay of imbalance under continuous-time GF
  • Lemma 3: Convergence of $D$‑gated loss to $L_{2,2/D}$ regularized loss under GF
  • Lemma 4: Imbalance evolution under discrete-time GD
  • proof
  • proof
  • proof
  • ...and 3 more