Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance

Songze Li; Mingyu Gao; Tonghua Su; Xu-Yao Zhang; Zhongjie Wang

Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance

Songze Li, Mingyu Gao, Tonghua Su, Xu-Yao Zhang, Zhongjie Wang

TL;DR

This paper reframes catastrophic forgetting in multimodal continual instruction tuning (MCIT) as a missing-gradient problem, proposing a dynamic gradient guidance strategy that uses the directional vector from current parameters to previously optimal parameters to approximate old-task gradients. The method combines this gradient guidance, limited replay data, and a Bernoulli-based dynamic update rule to balance stability and plasticity, achieving state-of-the-art performance without model expansion on two MCIT benchmarks. Key contributions include a formal gradient-approximation framework, a scalable update mechanism, and comprehensive ablations across datasets with varying distribution shifts. The results suggest a practical path for robust MCIT with compact architectures, though limitations remain when distribution shifts are large and replay data storage becomes a factor.

Abstract

Multimodal continual instruction tuning enables multimodal large language models to sequentially adapt to new tasks while building upon previously acquired knowledge. However, this continual learning paradigm faces the significant challenge of catastrophic forgetting, where learning new tasks leads to performance degradation on previous ones. In this paper, we introduce a novel insight into catastrophic forgetting by conceptualizing it as a problem of missing gradients from old tasks during new task learning. Our approach approximates these missing gradients by leveraging the geometric properties of the parameter space, specifically using the directional vector between current parameters and previously optimal parameters as gradient guidance. This approximated gradient can be further integrated with real gradients from a limited replay buffer and regulated by a Bernoulli sampling strategy that dynamically balances model stability and plasticity. Extensive experiments on multimodal continual instruction tuning datasets demonstrate that our method achieves state-of-the-art performance without model expansion, effectively mitigating catastrophic forgetting while maintaining a compact architecture.

Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance

TL;DR

Abstract

Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)