Pay Attention to Small Weights
Chao Zhou, Tom Jacobs, Advait Gadhikar, Rebekka Burkholz
TL;DR
NanoAdam tackles the high memory and compute costs of finetuning large pretrained models by exploiting a consistent gradient–weight relationship observed during finetuning, where large gradients tend to occur on small-magnitude weights. It introduces a gradient-free, per-layer bottom-$k$ masking strategy with a density scheduler to update only small weights, enabling larger effective learning rates and reducing memory. Theoretical analysis in a two-layer teacher–student model shows updating small weights preserves the original representation and mitigates catastrophic forgetting, while empirical results on NLP (GLUE) and CV (CIFAR-10, Flowers102) show improved generalization and smaller parameter drift compared with baselines. This approach scales to large models and provides memory-efficient continual learning benefits across NLP and vision tasks.
Abstract
Finetuning large pretrained neural networks is known to be resource-intensive, both in terms of memory and computational cost. To mitigate this, a common approach is to restrict training to a subset of the model parameters. By analyzing the relationship between gradients and weights during finetuning, we observe a notable pattern: large gradients are often associated with small-magnitude weights. This correlation is more pronounced in finetuning settings than in training from scratch. Motivated by this observation, we propose NANOADAM, which dynamically updates only the small-magnitude weights during finetuning and offers several practical advantages: first, this criterion is gradient-free -- the parameter subset can be determined without gradient computation; second, it preserves large-magnitude weights, which are likely to encode critical features learned during pretraining, thereby reducing the risk of catastrophic forgetting; thirdly, it permits the use of larger learning rates and consistently leads to better generalization performance in experiments. We demonstrate this for both NLP and vision tasks.
