Monge SAM: Robust Reparameterization-Invariant Sharpness-Aware Minimization Based on Loss Geometry
Albert Kjøller Jacobsen, Georgios Arvanitidis
TL;DR
Monge SAM introduces a reparameterization-invariant sharpness-aware minimization by defining a Monge metric $\mathbf{G}(\boldsymbol{\theta}) = \mathbf{I}_K + \nabla \ell(\boldsymbol{\theta}) \nabla \ell(\boldsymbol{\theta})^\top$ and deriving a closed-form worst-case perturbation $\boldsymbol{\delta}_{\text{M-SAM}}^* = \frac{1}{\sqrt{1+\|\nabla \ell(\boldsymbol{\theta})\|_2^2}} \cdot \big(\frac{\rho}{\|\nabla \ell(\boldsymbol{\theta})\|_2} \nabla \ell(\boldsymbol{\theta})\big)$. This makes M-SAM interpolate between SGD and SAM, yielding robustness to hyperparameters and reduced saddle-point attraction while remaining applicable to any modeling choice. The authors provide theoretical stability analyses and empirical demonstrations on toy 2D losses, CIFAR-10 fine-tuning, and cross-modal CLIP alignment, showing improved stability and representational alignment over SAM in several settings. Overall, M-SAM offers a geometry-aware alternative to SAM that leverages loss geometry to achieve reparameterization invariance with practical benefits for generalization and robustness.
Abstract
Recent studies on deep neural networks show that flat minima of the loss landscape correlate with improved generalization. Sharpness-aware minimization (SAM) efficiently finds flat regions by updating the parameters according to the gradient at an adversarial perturbation. The perturbation depends on the Euclidean metric, making SAM non-invariant under reparametrizations, which blurs sharpness and generalization. We propose Monge SAM (M-SAM), a reparametrization invariant version of SAM by considering a Riemannian metric in the parameter space induced naturally by the loss surface. Compared to previous approaches, M-SAM works under any modeling choice, relies only on mild assumptions while being as computationally efficient as SAM. We theoretically argue that M-SAM varies between SAM and gradient descent (GD), which increases robustness to hyperparameter selection and reduces attraction to suboptimal equilibria like saddle points. We demonstrate this behavior both theoretically and empirically on a multi-modal representation alignment task.
