Table of Contents
Fetching ...

How to Teach Large Multimodal Models New Skills

Zhen Zhu, Yiming Gong, Yao Xiao, Yaoyao Liu, Derek Hoiem

TL;DR

The paper investigates how to teach large multimodal models new skills without erasing prior abilities by evaluating sequential fine-tuning across five target tasks and eight held-out benchmarks on three backbones. It reveals that apparent forgetting is often temporary and linked to shifts in the next-token output distribution, which can be measured with a counting-bias probe. By analyzing transformer components, the authors identify two robust tuning recipes that limit drift: updating only self-attention projection layers (SA Proj) or updating only the MLP Gate&Up while freezing the Down projection. Across backbones and tasks, these strategies yield strong target gains with minimal forgetting, offering practical guidance for stable continual learning in LMMs.

Abstract

How can we teach large multimodal models (LMMs) new skills without erasing prior abilities? We study sequential fine-tuning on five target skills while monitoring general ability on eight held-out benchmarks across three model families. We observe that apparent "forgetting" on held-out tasks after narrow fine-tuning can partly recover at later stages. We trace this behavior to a measurable shift in the output token distribution, manifested through a simple counting-bias probe that co-varies with forgetting. Guided by this picture, we identify two simple, robust tuning recipes that learn strongly while limiting drift: (i) updating only the self-attention projection layers, and (ii) updating only the MLP Gate&Up while freezing the Down projection. Across models and tasks, these choices deliver strong target gains while largely preserving held-out performance. Code is available at https://github.com/jessemelpolio/LMM_CL

How to Teach Large Multimodal Models New Skills

TL;DR

The paper investigates how to teach large multimodal models new skills without erasing prior abilities by evaluating sequential fine-tuning across five target tasks and eight held-out benchmarks on three backbones. It reveals that apparent forgetting is often temporary and linked to shifts in the next-token output distribution, which can be measured with a counting-bias probe. By analyzing transformer components, the authors identify two robust tuning recipes that limit drift: updating only self-attention projection layers (SA Proj) or updating only the MLP Gate&Up while freezing the Down projection. Across backbones and tasks, these strategies yield strong target gains with minimal forgetting, offering practical guidance for stable continual learning in LMMs.

Abstract

How can we teach large multimodal models (LMMs) new skills without erasing prior abilities? We study sequential fine-tuning on five target skills while monitoring general ability on eight held-out benchmarks across three model families. We observe that apparent "forgetting" on held-out tasks after narrow fine-tuning can partly recover at later stages. We trace this behavior to a measurable shift in the output token distribution, manifested through a simple counting-bias probe that co-varies with forgetting. Guided by this picture, we identify two simple, robust tuning recipes that learn strongly while limiting drift: (i) updating only the self-attention projection layers, and (ii) updating only the MLP Gate&Up while freezing the Down projection. Across models and tasks, these choices deliver strong target gains while largely preserving held-out performance. Code is available at https://github.com/jessemelpolio/LMM_CL

Paper Structure

This paper contains 60 sections, 18 equations, 10 figures, 29 tables.

Figures (10)

  • Figure 1: Surprising Forgetting Behavior in LMMs: Left: When fine-tuning most components on one target task, we see major improvement in that task ("Learning") but a substantial drop in performance of other tasks ("Forgetting", total across tasks shown here), as expected. But if we only tune self-attention projection layers (SA Proj.) in the language model, we still get substantial learning on the target task with minimal forgetting. Right: Even fine-tuning SA Proj. for multiple tasks sequentially, we see no forgetting. For others, we see large forgetting on the PixmoCount task, but the models somehow partly recover what they "forgot" in learning the next specialized task. Our paper documents and analyzes these and other interesting phenomena of learning and forgetting in LMMs, leading to simple and effective ways to teach LMMs new skills.
  • Figure 2: Architecture of our evaluated LMMs. The input contains visual inputs such as images or videos, which are converted to visual tokens by the vision encoder, and text input is processed by a tokenizer containing a visual placeholder token <image>. Visual tokens are converted by the projector and concatenated with text tokens as input for the language model. We visualize the architecture of the transformer decoder layer of the language model. "LN", "MHA", "MLP" represent layer norm, multi-head attention, and multi-layer perceptron, respectively. $r^{(l)}$ is the final output of layer $l$.
  • Figure 3: Learning–forgetting tracks output–distribution shift. On LLaVA‑OneVision tuned for counting, we plot five curves over log‑spaced steps for LLM, SA Proj., MLP, MLP (Gate&Up) and MLP (LwF). The dashed line represents the base model. Left: PixmoCount accuracy rises for all methods. Middle: mean held‑out performance drops sharply for LLM and MLP, remains nearly unchanged for SA Proj., and is preserved by MLP (LwF); Right: the expected likelihood of number tokens on non‑counting captions (LCS‑558K liu2023visualinstructiontuning) surges for LLM and MLP, stays near baseline for SA Proj., and has little changes for MLP (LwF).
  • Figure 4: Visualizations on counting and captioning examples after tuning tuning MLP and SA Proj. on the counting task. The counting and examples are sampled from the PixmoCount dataset and the LCS‑558K liu2023visualinstructiontuning dataset, respectively.
  • Figure 5: Comparison of different continual learning techniques in the default sequential task curriculum. For LwF, WiSE-FT, only the MLP layers are tuned. LoRA adapters are wrapped only on the MLP layers. MoE is also applied to the MLP layers.
  • ...and 5 more figures