Table of Contents
Fetching ...

Efficient Multi-modal Large Language Models via Progressive Consistency Distillation

Zichen Wen, Shaobo Wang, Yufa Zhou, Junyuan Zhang, Qintong Zhang, Yifeng Gao, Zhaorun Chen, Bin Wang, Weijia Li, Conghui He, Linfeng Zhang

TL;DR

Efficient multi-modal large language models (MLLMs) suffer from high computation due to dense visual tokens. The authors introduce EPIC, a progressive consistency distillation framework that trains a single MLLM as both teacher and student with shared weights, using Token Consistency Distillation (TCD) and Layer Consistency Distillation (LCD) to progressively adapt to token compression. By jointly scheduling token-level and layer-wise compression with a gradually increasing teacher-student gap, EPIC achieves competitive accuracy with substantially fewer visual tokens and improved efficiency, robustness, and generalization across strategies. The method requires no architectural changes and demonstrates practical benefits for resource-constrained deployment, though it highlights the need to balance compression level to optimize latency-accuracy trade-offs. Overall, EPIC advances training-aware token compression by enabling smooth, progressive adaptation in both token and layer dimensions, yielding robust, generalizable MLLMs under varying inference budgets.

Abstract

Visual tokens consume substantial computational resources in multi-modal large models (MLLMs), significantly compromising their efficiency. Recent works have attempted to improve efficiency by compressing visual tokens during training, either through modifications to model components or by introducing additional parameters. However, they often overlook the increased learning difficulty caused by such compression, as the model's parameter space struggles to quickly adapt to the substantial perturbations in the feature space induced by token compression. In this work, we propose to develop Efficient MLLMs via Progressive Consistency Distillation (EPIC), a progressive learning framework. Specifically, by decomposing the feature space perturbations introduced by token compression along the token-wise and layer-wise dimensions, we introduce token consistency distillation and layer consistency distillation, respectively, aiming to reduce the training difficulty by leveraging guidance from a teacher model and following a progressive learning trajectory. Extensive experiments demonstrate the superior effectiveness, robustness, and generalization capabilities of our proposed framework.

Efficient Multi-modal Large Language Models via Progressive Consistency Distillation

TL;DR

Efficient multi-modal large language models (MLLMs) suffer from high computation due to dense visual tokens. The authors introduce EPIC, a progressive consistency distillation framework that trains a single MLLM as both teacher and student with shared weights, using Token Consistency Distillation (TCD) and Layer Consistency Distillation (LCD) to progressively adapt to token compression. By jointly scheduling token-level and layer-wise compression with a gradually increasing teacher-student gap, EPIC achieves competitive accuracy with substantially fewer visual tokens and improved efficiency, robustness, and generalization across strategies. The method requires no architectural changes and demonstrates practical benefits for resource-constrained deployment, though it highlights the need to balance compression level to optimize latency-accuracy trade-offs. Overall, EPIC advances training-aware token compression by enabling smooth, progressive adaptation in both token and layer dimensions, yielding robust, generalizable MLLMs under varying inference budgets.

Abstract

Visual tokens consume substantial computational resources in multi-modal large models (MLLMs), significantly compromising their efficiency. Recent works have attempted to improve efficiency by compressing visual tokens during training, either through modifications to model components or by introducing additional parameters. However, they often overlook the increased learning difficulty caused by such compression, as the model's parameter space struggles to quickly adapt to the substantial perturbations in the feature space induced by token compression. In this work, we propose to develop Efficient MLLMs via Progressive Consistency Distillation (EPIC), a progressive learning framework. Specifically, by decomposing the feature space perturbations introduced by token compression along the token-wise and layer-wise dimensions, we introduce token consistency distillation and layer consistency distillation, respectively, aiming to reduce the training difficulty by leveraging guidance from a teacher model and following a progressive learning trajectory. Extensive experiments demonstrate the superior effectiveness, robustness, and generalization capabilities of our proposed framework.

Paper Structure

This paper contains 40 sections, 1 theorem, 29 equations, 7 figures, 7 tables.

Key Result

Theorem 1

Under assumptions (S1)–(S3), the total variation of the progressive path is strictly smaller:

Figures (7)

  • Figure 1: Progressive Consistency Distillation vs. Direct Training. Each subplot shows the loss landscape under the corresponding token compression ratio, with the optimum indicated. Our method reaches the objective via progressive learning trajectories, while direct training remains challenging.
  • Figure 2: MMBench accuracy vs. number of visual tokens for various methods. TCD (Ours) and LCD (Ours) achieve competitive accuracy with far fewer tokens, lower FLOPs, and smaller KV cache compared to LLaVA-v1.5, highlighting its efficiency.
  • Figure 3: An overview of Progressive Consistency Distillation. (i) Token Consistency Distillation progressively increases token compression ratio over time. (ii) Layer Consistency Distillation shifts token compression from deep to shallow layers, promoting layer-wise consistency during training.
  • Figure 4: Following LLaVA-v1.5's architecture and data, we apply DART for token consistency distillation. "w/o train" denotes vanilla LLaVA. At inference, all methods use $88.9\%$ token compression.
  • Figure 5: All experiments use the model trained following LLaVA-v1.5. FLOPs and latency are measured on the POPE. Visual token and latency experiments are repeated three times for reliability.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Theorem 1: Scalar path gain, Proof in Appendix \ref{['app_sec:proof']}
  • proof : Proof of Theorem \ref{['thm:scalar']}