Nuclear Norm Regularization for Deep Learning
Christopher Scarvelis, Justin Solomon
TL;DR
This paper tackles the challenge of regularizing neural networks to have locally low-rank Jacobians by penalizing the nuclear norm of the Jacobian, which is computationally prohibitive in high dimensions. It shows that for $f=g\circ h$ the non-linear nuclear-norm penalty can be exactly recast as a sum of two squared Frobenius norms, enabling a Jacobian-norm regularizer that avoids costly SVDs; it further replaces Jacobian terms with a denoising-style estimator based on Hutchinson's trace estimator. The authors provide a rigorous equivalence theorem and a practical estimator, demonstrating the approach on ROF denoising, unsupervised denoising on ImageNet, SVS-inspired denoising, and representation learning with a regularized autoencoder. The results indicate that the proposed Jacobian-norm regularization scales to high-dimensional problems and yields competitive or meaningful improvements in denoising quality and latent representations, highlighting its potential for broad adoption in deep learning pipelines.
Abstract
Penalizing the nuclear norm of a function's Jacobian encourages it to locally behave like a low-rank linear map. Such functions vary locally along only a handful of directions, making the Jacobian nuclear norm a natural regularizer for machine learning problems. However, this regularizer is intractable for high-dimensional problems, as it requires computing a large Jacobian matrix and taking its singular value decomposition. We show how to efficiently penalize the Jacobian nuclear norm using techniques tailor-made for deep learning. We prove that for functions parametrized as compositions $f = g \circ h$, one may equivalently penalize the average squared Frobenius norm of $Jg$ and $Jh$. We then propose a denoising-style approximation that avoids the Jacobian computations altogether. Our method is simple, efficient, and accurate, enabling Jacobian nuclear norm regularization to scale to high-dimensional deep learning problems. We complement our theory with an empirical study of our regularizer's performance and investigate applications to denoising and representation learning.
