Differentiable Sparsity via $D$-Gating: Simple and Versatile Structured Penalization
Chris Kolb, Laetitia Frost, Bernd Bischl, David Rügamer
TL;DR
D-Gating introduces a differentiable overparameterization that splits each parameter group into a primary weight and multiple gating factors, enabling SGD-compatible optimization while achieving structured sparsity. The authors prove that, at balance, the D-Gating objective is equivalent to directly optimizing the non-differentiable $L_{2,2/D}$ group penalty, and show that the gating imbalance decays exponentially under gradient flow and geometrically under SGD. The approach yields strong sparsity-accuracy tradeoffs across vision, language, and tabular tasks without post-hoc pruning and with negligible overhead, highlighting its modularity and practicality for diverse architectures. This work advances principled, differentiable structured sparsity that integrates smoothly with standard training pipelines and provides theoretical guarantees alongside broad empirical validation. Overall, D-Gating offers a versatile, theoretically sound pathway to scalable structured sparsity in modern deep learning models.
Abstract
Structured sparsity regularization offers a principled way to compact neural networks, but its non-differentiability breaks compatibility with conventional stochastic gradient descent and requires either specialized optimizers or additional post-hoc pruning without formal guarantees. In this work, we propose $D$-Gating, a fully differentiable structured overparameterization that splits each group of weights into a primary weight vector and multiple scalar gating factors. We prove that any local minimum under $D$-Gating is also a local minimum using non-smooth structured $L_{2,2/D}$ penalization, and further show that the $D$-Gating objective converges at least exponentially fast to the $L_{2,2/D}$-regularized loss in the gradient flow limit. Together, our results show that $D$-Gating is theoretically equivalent to solving the original group sparsity problem, yet induces distinct learning dynamics that evolve from a non-sparse regime into sparse optimization. We validate our theory across vision, language, and tabular tasks, where $D$-Gating consistently delivers strong performance-sparsity tradeoffs and outperforms both direct optimization of structured penalties and conventional pruning baselines.
