MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning
Yupeng Chen, Senmiao Wang, Yushun Zhang, Zhihang Lin, Haozhe Zhang, Weijian Sun, Tian Ding, Ruoyu Sun
TL;DR
This work tackles forgetting during large language model fine-tuning when pretraining data may be unavailable. It introduces MoFO, a momentum-filtered optimizer that integrates Adam with greedy block coordinate descent by updating only the top-$\alpha$ momentum coordinates in each parameter block, thereby keeping the model closer to its pretrained state while still attaining strong fine-tuning performance. The authors prove convergence to a critical point with a rate $\mathcal{O}(\log T/\sqrt{T})$ in the $L_{1,\text{top-}\alpha}$-filtered gradient sense and provide an illustrative example showing reduced forgetting in MoFO compared to Adam. Empirically, MoFO matches standard fine-tuning performance and significantly mitigates forgetting across a range of tasks (math, code, medical, and common-sense benchmarks) and continual fine-tuning scenarios, with only modest computational overhead, and it can complement LoRA and other PEFT methods. This approach offers a practical, data-replay-free strategy to preserve pretrained knowledge in real-world fine-tuning, with potential extensions to adaptive update schedules and RLHF pipelines.
Abstract
Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks. Typically, LLMs are first pre-trained on large corpora and subsequently fine-tuned on task-specific datasets. However, during fine-tuning, LLMs may forget some knowledge acquired in the pre-training stage, leading to a decline in general capabilities. Existing approaches to mitigate forgetting often rely on access to pre-training data, which may be unavailable in many real-world scenarios--such as fine-tuning checkpoint-only open-source LLMs. To address this challenge, we propose a new fine-tuning algorithm termed Momentum-Filtered Optimizer (MoFO). MoFO is an extension of greedy block coordinate descent (BCD) methods: in each iteration, MoFO only updates the model parameters with the largest momentum magnitudes, while keeping all other parameters fixed. MoFO achieves similar fine-tuning performance to the default fine-tuning algorithm while effectively mitigating knowledge forgetting. We validate MoFO through rigorous convergence analysis and extensive experiments, demonstrating its effectiveness in mitigating forgetting without pre-training data.
