LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation

Fangxun Shu; Yue Liao; Le Zhuo; Chenning Xu; Lei Zhang; Guanghao Zhang; Haonan Shi; Long Chen; Tao Zhong; Wanggui He; Siming Fu; Haoyuan Li; Bolin Li; Zhelun Yu; Si Liu; Hongsheng Li; Hao Jiang

LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation

Fangxun Shu, Yue Liao, Le Zhuo, Chenning Xu, Lei Zhang, Guanghao Zhang, Haonan Shi, Long Chen, Tao Zhong, Wanggui He, Siming Fu, Haoyuan Li, Bolin Li, Zhelun Yu, Si Liu, Hongsheng Li, Hao Jiang

Abstract

We introduce LLaVA-MoD, a novel framework designed to enable the efficient training of small-scale Multimodal Language Models (s-MLLM) by distilling knowledge from large-scale MLLM (l-MLLM). Our approach tackles two fundamental challenges in MLLM distillation. First, we optimize the network structure of s-MLLM by integrating a sparse Mixture of Experts (MoE) architecture into the language model, striking a balance between computational efficiency and model expressiveness. Second, we propose a progressive knowledge transfer strategy to ensure comprehensive knowledge migration. This strategy begins with mimic distillation, where we minimize the Kullback-Leibler (KL) divergence between output distributions to enable the student model to emulate the teacher network's understanding. Following this, we introduce preference distillation via Direct Preference Optimization (DPO), where the key lies in treating l-MLLM as the reference model. During this phase, the s-MLLM's ability to discriminate between superior and inferior examples is significantly enhanced beyond l-MLLM, leading to a better student that surpasses its teacher, particularly in hallucination benchmarks. Extensive experiments demonstrate that LLaVA-MoD outperforms existing models across various multimodal benchmarks while maintaining a minimal number of activated parameters and low computational costs. Remarkably, LLaVA-MoD, with only 2B activated parameters, surpasses Qwen-VL-Chat-7B by an average of 8.8% across benchmarks, using merely 0.3% of the training data and 23% trainable parameters. These results underscore LLaVA-MoD's ability to effectively distill comprehensive knowledge from its teacher model, paving the way for the development of more efficient MLLMs. The code will be available on: https://github.com/shufangxun/LLaVA-MoD.

LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation

Abstract

Paper Structure (21 sections, 8 equations, 3 figures, 16 tables)

This paper contains 21 sections, 8 equations, 3 figures, 16 tables.

Introduction
Related Work
Method
Architecture Design of Sparse s-MLLM
Progressive Distillation
Experiments
Experimental Settings
Main Results
Ablation Study
Impact of Preference Distillation
Impact of Training Strategy
Impact of Model Architecture
Sparse Architecture Facilitates Knowledge Transfer.
Conclusion
Implementation Details
...and 6 more sections

Figures (3)

Figure 1: Comparisons of training cost and performance. LLaVA-MoD achieves comparable performance with advanced MLLMs using significantly lower training costs while outperforming current small-scale MLLMs by a large margin.
Figure 2: Progressive Distillation of LLaVA-MoD. (a). Mimic Distillation: Aligning the student's response probabilities $p_{S}$ with those of the teacher $p_{T}$, via the Kullback–Leibler (KL) loss. (b). Preference Distillation: Increasing the student's positive response probabilities $p^{+}_{S}$ to surpass those of the teacher $p^{+}_{T}$, while decreasing the student's negative response probabilities $p^{-}_{S}$ to fall below those of the teacher $p^{-}_{T}$, via the Preference-Optimization (PO) loss.
Figure 3: Sparsification of s-MLLM. The VL Adaptor and Vision Encoder remain unchanged, while the LLM is upcycled from dense to sparse.

LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation

Abstract

LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation

Authors

Abstract

Table of Contents

Figures (3)