MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning

Yupeng Chen; Senmiao Wang; Yushun Zhang; Zhihang Lin; Haozhe Zhang; Weijian Sun; Tian Ding; Ruoyu Sun

MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning

Yupeng Chen, Senmiao Wang, Yushun Zhang, Zhihang Lin, Haozhe Zhang, Weijian Sun, Tian Ding, Ruoyu Sun

TL;DR

This work tackles forgetting during large language model fine-tuning when pretraining data may be unavailable. It introduces MoFO, a momentum-filtered optimizer that integrates Adam with greedy block coordinate descent by updating only the top-$\alpha$ momentum coordinates in each parameter block, thereby keeping the model closer to its pretrained state while still attaining strong fine-tuning performance. The authors prove convergence to a critical point with a rate $\mathcal{O}(\log T/\sqrt{T})$ in the $L_{1,\text{top-}\alpha}$-filtered gradient sense and provide an illustrative example showing reduced forgetting in MoFO compared to Adam. Empirically, MoFO matches standard fine-tuning performance and significantly mitigates forgetting across a range of tasks (math, code, medical, and common-sense benchmarks) and continual fine-tuning scenarios, with only modest computational overhead, and it can complement LoRA and other PEFT methods. This approach offers a practical, data-replay-free strategy to preserve pretrained knowledge in real-world fine-tuning, with potential extensions to adaptive update schedules and RLHF pipelines.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks. Typically, LLMs are first pre-trained on large corpora and subsequently fine-tuned on task-specific datasets. However, during fine-tuning, LLMs may forget some knowledge acquired in the pre-training stage, leading to a decline in general capabilities. Existing approaches to mitigate forgetting often rely on access to pre-training data, which may be unavailable in many real-world scenarios--such as fine-tuning checkpoint-only open-source LLMs. To address this challenge, we propose a new fine-tuning algorithm termed Momentum-Filtered Optimizer (MoFO). MoFO is an extension of greedy block coordinate descent (BCD) methods: in each iteration, MoFO only updates the model parameters with the largest momentum magnitudes, while keeping all other parameters fixed. MoFO achieves similar fine-tuning performance to the default fine-tuning algorithm while effectively mitigating knowledge forgetting. We validate MoFO through rigorous convergence analysis and extensive experiments, demonstrating its effectiveness in mitigating forgetting without pre-training data.

MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning

TL;DR

momentum coordinates in each parameter block, thereby keeping the model closer to its pretrained state while still attaining strong fine-tuning performance. The authors prove convergence to a critical point with a rate

in the

-filtered gradient sense and provide an illustrative example showing reduced forgetting in MoFO compared to Adam. Empirically, MoFO matches standard fine-tuning performance and significantly mitigates forgetting across a range of tasks (math, code, medical, and common-sense benchmarks) and continual fine-tuning scenarios, with only modest computational overhead, and it can complement LoRA and other PEFT methods. This approach offers a practical, data-replay-free strategy to preserve pretrained knowledge in real-world fine-tuning, with potential extensions to adaptive update schedules and RLHF pipelines.

Abstract

Paper Structure (54 sections, 9 theorems, 101 equations, 15 figures, 17 tables, 3 algorithms)

This paper contains 54 sections, 9 theorems, 101 equations, 15 figures, 17 tables, 3 algorithms.

Introduction
Momentum Filtered Optimizer (MoFO)
Motivation
Formulation of MoFO
Theoretical Analysis
Convergence Result
Choice of norm and two subgoals.
Initial Analysis on Forgetting Mitigation
Experiments
Experimental Settings
Instruction Fine-Tuning
Continual Fine-Tuning
Impact of Update Strategy in MoFO
Furthur Analysis
Related Works
...and 39 more sections

Key Result

Theorem 1

Suppose that the first- and second-order momentum hyperparameters $\beta_1$ and $\beta_2$ satisfy $0 < \beta_1 < \sqrt{\beta_2} < 1$. The learning rate schedule at step $t$ is $\eta_t = \eta / \sqrt{t}$ for some $\eta > 0$. Then, under Assumption asmp:mofo-hyperparam, MoFO satisfies Moreover, this bound directly implies for any $p \in [1, \infty]$.

Figures (15)

Figure 1: The loss landscapes of Pythia-160M after fine-tuning on a subset of the FLAN dataset using Adam and Lion. We plot the loss landscapes on (a) the fine-tuning dataset and (b) the pre-training dataset (Pile dataset gao2020pile). We visualize a 2D weight-space plane spanned by the vector from the pre-trained model to the Lion-tuned model (x-axis) and to the Adam-tuned model (y-axis). Axes are normalized so that one unit equals the length of the pre-trained$\to$Adam vector. The color bar indicates the loss value—(a) fine-tuning loss and (b) pre-training loss. A logarithmic scale is applied to the loss values for better visualization. Two training methods converge to different minima with similar fine-tuning loss. Lion converges to a farther minimum from the pre-trained model and performs more forgetting than Adam.
Figure 2: (a) Loss changes on the RedPajama dataset and (b) average accuracy changes on MMLU benchmark (measuring the preservation of factual knowledge) of Llama-2-7B after fine-tuning on MetaMathQA using Adam, Lion, and MoFO for 0.5, 1, 1.5, 2 epochs. We note that RedPajama project was explicitly designed as an open-source reproduction of the LLaMA training dataset together2023redpajama. Thus, it serves as a reasonable proxy for the original LLaMA-2 training dataset since the latter has not been publicly released. See Appendix \ref{['subapp:redpajama']} for the rationale. The results show a strong positive correlation between the distance from the pre-trained model and the extent of forgetting after one epoch. Further discussion of early-training behavior and a comparison of different optimizers are provided in Appendix \ref{['app:supp-correlation—additional']}.
Figure 3: Illustration of MoFO.
Figure 4: The fine-tuning loss landscape and the training paths of different optimization methods. The color bar indicates the fine-tuning loss value.
Figure 5: The performance on the math task (GSM8K) and the scores in general capabilities of Llama-2-7B after fine-tuning on the MetaMathQA dataset. Only points on the Pareto front are shown as solid points, while the remaining points are presented as semi-transparent. The results show that compared with $L_1$, $L_2$ regularization, and LoRA across various hyperparameter configurations, the MoFO algorithm achieves a better Pareto front.
...and 10 more figures

Theorems & Definitions (24)

Theorem 1: Convergence of MoFO
proof : Proof Sketch of Theorem \ref{['thm:mofo-cvrg']}:
Example 1
Remark 1
Theorem 2
Definition 1
Remark 2
Definition 2
Proposition 1
proof
...and 14 more

MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning

TL;DR

Abstract

MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (24)