Table of Contents
Fetching ...

HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model

Haiyang Guo, Fanhu Zeng, Ziwei Xiang, Fei Zhu, Da-Han Wang, Xu-Yao Zhang, Cheng-Lin Liu

TL;DR

HiDe-LLaVA tackles continual instruction tuning for multimodal LLMs by introducing hierarchical decoupling: a task-specific expansion on the top layer and a task-general fusion on the remaining layers, guided by layer-wise CKA similarity which shows the top layer is more task-specific while lower layers capture general knowledge. The method uses LoRA adapters across all layers, with a Mixture-of-Experts-like top-layer expansion driven by image/text anchors derived from CLIP, and a simple fusion of LoRAs in the lower layers to preserve shared knowledge. To ensure fair evaluation, the authors identify information leakage in existing benchmarks and propose UCIT, a more challenging unseen-task benchmark, and demonstrate state-of-the-art performance on UCIT and CoIN with improved Avg and Last metrics and better efficiency. The work advances practical continual learning for MLLMs by enabling robust, scalable continual instruction tuning and providing fair evaluation protocols for benchmarking.

Abstract

Instruction tuning is widely used to improve a pre-trained Multimodal Large Language Model (MLLM) by training it on curated task-specific datasets, enabling better comprehension of human instructions. However, it is infeasible to collect all possible instruction datasets simultaneously in real-world scenarios. Thus, enabling MLLM with continual instruction tuning is essential for maintaining their adaptability. However, existing methods often trade off memory efficiency for performance gains, significantly compromising overall efficiency. In this paper, we propose a task-specific expansion and task-general fusion framework based on the variations in Centered Kernel Alignment (CKA) similarity across different model layers when trained on diverse datasets. Furthermore, we analyze the information leakage present in the existing benchmark and propose a new and more challenging benchmark to rationally evaluate the performance of different methods. Comprehensive experiments showcase a significant performance improvement of our method compared to existing state-of-the-art methods. Code and dataset are released at https://github.com/Ghy0501/HiDe-LLaVA.

HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model

TL;DR

HiDe-LLaVA tackles continual instruction tuning for multimodal LLMs by introducing hierarchical decoupling: a task-specific expansion on the top layer and a task-general fusion on the remaining layers, guided by layer-wise CKA similarity which shows the top layer is more task-specific while lower layers capture general knowledge. The method uses LoRA adapters across all layers, with a Mixture-of-Experts-like top-layer expansion driven by image/text anchors derived from CLIP, and a simple fusion of LoRAs in the lower layers to preserve shared knowledge. To ensure fair evaluation, the authors identify information leakage in existing benchmarks and propose UCIT, a more challenging unseen-task benchmark, and demonstrate state-of-the-art performance on UCIT and CoIN with improved Avg and Last metrics and better efficiency. The work advances practical continual learning for MLLMs by enabling robust, scalable continual instruction tuning and providing fair evaluation protocols for benchmarking.

Abstract

Instruction tuning is widely used to improve a pre-trained Multimodal Large Language Model (MLLM) by training it on curated task-specific datasets, enabling better comprehension of human instructions. However, it is infeasible to collect all possible instruction datasets simultaneously in real-world scenarios. Thus, enabling MLLM with continual instruction tuning is essential for maintaining their adaptability. However, existing methods often trade off memory efficiency for performance gains, significantly compromising overall efficiency. In this paper, we propose a task-specific expansion and task-general fusion framework based on the variations in Centered Kernel Alignment (CKA) similarity across different model layers when trained on diverse datasets. Furthermore, we analyze the information leakage present in the existing benchmark and propose a new and more challenging benchmark to rationally evaluate the performance of different methods. Comprehensive experiments showcase a significant performance improvement of our method compared to existing state-of-the-art methods. Code and dataset are released at https://github.com/Ghy0501/HiDe-LLaVA.

Paper Structure

This paper contains 21 sections, 7 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: Left: Output CKA similarity heatmaps for different task inputs across the bottom, middle, and top layers. Overall, the output similarity across different tasks markedly decreases at the top layer. Right: Detailed similarity comparison of different task pairs. It can be seen that even for these very different pairs (IconQA and Flickr30k), the similarity differences only appear in last layers and most layers are similar.
  • Figure 2: Impact of different LoRA operational strategies on individual task performance.
  • Figure 3: An overview of HiDe-LLaVA framework. (a) During training, we optimize the LoRA modules and projector layer with an autoregressive loss and the image-text anchors are extracted from the image and text encoders of CLIP. (b) At inference time, our method apply a MoE-like expansion on the top-layer LoRA and dynamically distribute expert weights via similarity matching with previously learned image and text anchors. For the remaining layers, general knowledge across tasks is incorporated through LoRA fusion.
  • Figure 4: Ablation studies of dual-modalities similarity matching on UCIT and CoIN benchmark.
  • Figure 5: Ablation studies of the fusion coefficient on UCIT and CoIN benchmark.
  • ...and 3 more figures