Table of Contents
Fetching ...

Hyperbolic Aware Minimization: Implicit Bias for Sparsity

Tom Jacobs, Advait Gadhikar, Celia Rubio-Madrigal, Rebekka Burkholz

TL;DR

This work characterization of the implicit bias in the context of underdetermined linear regression provides insights into the mechanism how HAM consistently increases performance --even in the case of dense training, as it demonstrates in experiments with standard vision benchmarks.

Abstract

Understanding the implicit bias of optimization algorithms is key to explaining and improving the generalization of deep models. The hyperbolic implicit bias induced by pointwise overparameterization promotes sparsity, but also yields a small inverse Riemannian metric near zero, slowing down parameter movement and impeding meaningful parameter sign flips. To overcome this obstacle, we propose Hyperbolic Aware Minimization (HAM), which alternates a standard optimizer step with a lightweight hyperbolic mirror step. The mirror step incurs less compute and memory than pointwise overparameterization, reproduces its beneficial hyperbolic geometry for feature learning, and mitigates the small-inverse-metric bottleneck. Our characterization of the implicit bias in the context of underdetermined linear regression provides insights into the mechanism how HAM consistently increases performance --even in the case of dense training, as we demonstrate in experiments with standard vision benchmarks. HAM is especially effective in combination with different sparsification methods, advancing the state of the art.

Hyperbolic Aware Minimization: Implicit Bias for Sparsity

TL;DR

This work characterization of the implicit bias in the context of underdetermined linear regression provides insights into the mechanism how HAM consistently increases performance --even in the case of dense training, as it demonstrates in experiments with standard vision benchmarks.

Abstract

Understanding the implicit bias of optimization algorithms is key to explaining and improving the generalization of deep models. The hyperbolic implicit bias induced by pointwise overparameterization promotes sparsity, but also yields a small inverse Riemannian metric near zero, slowing down parameter movement and impeding meaningful parameter sign flips. To overcome this obstacle, we propose Hyperbolic Aware Minimization (HAM), which alternates a standard optimizer step with a lightweight hyperbolic mirror step. The mirror step incurs less compute and memory than pointwise overparameterization, reproduces its beneficial hyperbolic geometry for feature learning, and mitigates the small-inverse-metric bottleneck. Our characterization of the implicit bias in the context of underdetermined linear regression provides insights into the mechanism how HAM consistently increases performance --even in the case of dense training, as we demonstrate in experiments with standard vision benchmarks. HAM is especially effective in combination with different sparsification methods, advancing the state of the art.

Paper Structure

This paper contains 51 sections, 18 theorems, 62 equations, 12 figures, 20 tables, 1 algorithm.

Key Result

Theorem 3.1

If $m_0 =\mathop{\mathrm{sign}}\nolimits(\bm{\theta}_0) w_0 = \sqrt{|\bm{\theta}_0|}$, then is equivalent to Eq. (hyperbolic pilot eq) up-to first order, i.e., the discretization error is $\mathcal{O}(\eta^2)$.

Figures (12)

  • Figure 1: The inverse metric $g^{-1}(\bm \theta)$ of HAM is above the one of gradient descent (GD), while the overparameterization $\bm m\odot \bm w$ is below for small $\bm \gamma$. This enables moving from the initialization $\bm \theta_0$ to the optimum $\bm \theta^*$ instead of getting stuck. Therefore, HAM fixes the vanishing inverse metric. Note the hyperbolic geometric structure of HAM and $\bm m \odot \bm w$ compared to the flatness of GD.
  • Figure 2: Demonstration of HAM's mechanisms. (a) The percentage of sign flips during training for Random PaI with sparsity level $90\%$ trained for $100$ epochs, where each interval correspond to ten epochs. HAM is able to consistently perform more sign flips than both the baseline and Sign-In. (b) Plot of the normalized Bregman function $R_{\alpha}$, where increasing $\alpha$ leads to an $L_1$ shape.
  • Figure 3: The difference between using $\mathop{\mathrm{sign}}\nolimits(\bm{\theta}_{k+\frac{1}{2}})$ and $\mathop{\mathrm{sign}}\nolimits (\bm{\theta}_k)$ in \ref{['eq : secondstep']}. The main change we incur by using the new sign is that it accelerates away from zero when a sign flip occurs. Thus, when parameters are small, we can be more certain that they are actually redundant. Furthermore, when sign flips become less frequent due to decreasing learning rate at the end of training, we get the same implicit bias regardless, as shown in Theorem \ref{['theorem : sign stability']}.
  • Figure 4: (Figure 2 from gadhikar2025signinlotteryreparameterizingsparse), showing sign flipping benefits achieved with pointwise overparameterization $m \odot w$, for the sparse and dense case on a single-hidden neuron model.
  • Figure 5: The first layer of a Resnet50's average inverse metric at zero reported at every tenth epoch.
  • ...and 7 more figures

Theorems & Definitions (26)

  • Theorem 3.1
  • Remark 3.2
  • Theorem 4.1
  • Theorem 4.2
  • Theorem 4.3
  • Lemma 4.4
  • Theorem 4.5
  • Theorem 4.6
  • Remark 4.7
  • Remark 4.8
  • ...and 16 more