Table of Contents
Fetching ...

Locate-then-Merge: Neuron-Level Parameter Fusion for Mitigating Catastrophic Forgetting in Multimodal LLMs

Zeping Yu, Sophia Ananiadou

TL;DR

The paper tackles catastrophic forgetting in multimodal LLMs that arise after vision-language instruction tuning. It introduces Locate-then-Merge, a training-free parameter fusion framework, and a neuron-level instantiation called Neuron-Fusion that preserves neurons with large parameter changes while suppressing small-change neurons to retain language ability while keeping visual adaptation. Across 13 benchmarks and two open-source MLLMs, Neuron-Fusion consistently outperforms existing model merging methods, and generation analysis shows reduced Not-Known and Context-Hallination, improving reliability. This approach provides a practical, data-free means to maintain language proficiency while enabling robust visual capabilities in multimodal models, with potential applicability to broader modalities and architectures in future work.

Abstract

Although multimodal large language models (MLLMs) have achieved impressive performance, the multimodal instruction tuning stage often causes catastrophic forgetting of the base LLM's language ability, even in strong models like Llama3. To address this, we propose Locate-then-Merge, a training-free parameter fusion framework that first locates important parameters and then selectively merges them. We further introduce Neuron-Fusion, a neuron-level strategy that preserves the influence of neurons with large parameter shifts--neurons likely responsible for newly acquired visual capabilities--while attenuating the influence of neurons with smaller changes that likely encode general-purpose language skills. This design enables better retention of visual adaptation while mitigating language degradation. Experiments on 13 benchmarks across both language and visual tasks show that Neuron-Fusion consistently outperforms existing model merging methods. Further analysis reveals that our method effectively reduces context hallucination in generation.

Locate-then-Merge: Neuron-Level Parameter Fusion for Mitigating Catastrophic Forgetting in Multimodal LLMs

TL;DR

The paper tackles catastrophic forgetting in multimodal LLMs that arise after vision-language instruction tuning. It introduces Locate-then-Merge, a training-free parameter fusion framework, and a neuron-level instantiation called Neuron-Fusion that preserves neurons with large parameter changes while suppressing small-change neurons to retain language ability while keeping visual adaptation. Across 13 benchmarks and two open-source MLLMs, Neuron-Fusion consistently outperforms existing model merging methods, and generation analysis shows reduced Not-Known and Context-Hallination, improving reliability. This approach provides a practical, data-free means to maintain language proficiency while enabling robust visual capabilities in multimodal models, with potential applicability to broader modalities and architectures in future work.

Abstract

Although multimodal large language models (MLLMs) have achieved impressive performance, the multimodal instruction tuning stage often causes catastrophic forgetting of the base LLM's language ability, even in strong models like Llama3. To address this, we propose Locate-then-Merge, a training-free parameter fusion framework that first locates important parameters and then selectively merges them. We further introduce Neuron-Fusion, a neuron-level strategy that preserves the influence of neurons with large parameter shifts--neurons likely responsible for newly acquired visual capabilities--while attenuating the influence of neurons with smaller changes that likely encode general-purpose language skills. This design enables better retention of visual adaptation while mitigating language degradation. Experiments on 13 benchmarks across both language and visual tasks show that Neuron-Fusion consistently outperforms existing model merging methods. Further analysis reveals that our method effectively reduces context hallucination in generation.

Paper Structure

This paper contains 30 sections, 10 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Neuron-Fusion in MLLMs. After visual tuning, some neurons exhibit larger changes than others. Neuron-Fusion selectively preserves neurons with significant parameter changes while suppressing those with smaller changes. This targeted fusion enables the model to retain newly acquired visual capabilities while minimally affecting its general language abilities.
  • Figure 2: The structures of LLM and MLLM.
  • Figure 3: Change of neurons in FFN up matrix.
  • Figure 4: Change of neurons in attention query matrix.
  • Figure 5: Change of coefficients after Neuron-Fusion.
  • ...and 7 more figures