How to Teach Large Multimodal Models New Skills
Zhen Zhu, Yiming Gong, Yao Xiao, Yaoyao Liu, Derek Hoiem
TL;DR
The paper investigates how to teach large multimodal models new skills without erasing prior abilities by evaluating sequential fine-tuning across five target tasks and eight held-out benchmarks on three backbones. It reveals that apparent forgetting is often temporary and linked to shifts in the next-token output distribution, which can be measured with a counting-bias probe. By analyzing transformer components, the authors identify two robust tuning recipes that limit drift: updating only self-attention projection layers (SA Proj) or updating only the MLP Gate&Up while freezing the Down projection. Across backbones and tasks, these strategies yield strong target gains with minimal forgetting, offering practical guidance for stable continual learning in LMMs.
Abstract
How can we teach large multimodal models (LMMs) new skills without erasing prior abilities? We study sequential fine-tuning on five target skills while monitoring general ability on eight held-out benchmarks across three model families. We observe that apparent "forgetting" on held-out tasks after narrow fine-tuning can partly recover at later stages. We trace this behavior to a measurable shift in the output token distribution, manifested through a simple counting-bias probe that co-varies with forgetting. Guided by this picture, we identify two simple, robust tuning recipes that learn strongly while limiting drift: (i) updating only the self-attention projection layers, and (ii) updating only the MLP Gate&Up while freezing the Down projection. Across models and tasks, these choices deliver strong target gains while largely preserving held-out performance. Code is available at https://github.com/jessemelpolio/LMM_CL
