Efficient Multi-modal Large Language Models via Progressive Consistency Distillation
Zichen Wen, Shaobo Wang, Yufa Zhou, Junyuan Zhang, Qintong Zhang, Yifeng Gao, Zhaorun Chen, Bin Wang, Weijia Li, Conghui He, Linfeng Zhang
TL;DR
Efficient multi-modal large language models (MLLMs) suffer from high computation due to dense visual tokens. The authors introduce EPIC, a progressive consistency distillation framework that trains a single MLLM as both teacher and student with shared weights, using Token Consistency Distillation (TCD) and Layer Consistency Distillation (LCD) to progressively adapt to token compression. By jointly scheduling token-level and layer-wise compression with a gradually increasing teacher-student gap, EPIC achieves competitive accuracy with substantially fewer visual tokens and improved efficiency, robustness, and generalization across strategies. The method requires no architectural changes and demonstrates practical benefits for resource-constrained deployment, though it highlights the need to balance compression level to optimize latency-accuracy trade-offs. Overall, EPIC advances training-aware token compression by enabling smooth, progressive adaptation in both token and layer dimensions, yielding robust, generalizable MLLMs under varying inference budgets.
Abstract
Visual tokens consume substantial computational resources in multi-modal large models (MLLMs), significantly compromising their efficiency. Recent works have attempted to improve efficiency by compressing visual tokens during training, either through modifications to model components or by introducing additional parameters. However, they often overlook the increased learning difficulty caused by such compression, as the model's parameter space struggles to quickly adapt to the substantial perturbations in the feature space induced by token compression. In this work, we propose to develop Efficient MLLMs via Progressive Consistency Distillation (EPIC), a progressive learning framework. Specifically, by decomposing the feature space perturbations introduced by token compression along the token-wise and layer-wise dimensions, we introduce token consistency distillation and layer consistency distillation, respectively, aiming to reduce the training difficulty by leveraging guidance from a teacher model and following a progressive learning trajectory. Extensive experiments demonstrate the superior effectiveness, robustness, and generalization capabilities of our proposed framework.
