FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance
Haicheng Wang, Zhemeng Yu, Gabriele Spadaro, Chen Ju, Victor Quétu, Shuai Xiao, Enzo Tartaglione
TL;DR
FOLDER addresses the bottleneck of long visual token sequences in Multi-modal Large Language Models by introducing a plug-and-play token-reduction module that concentrates aggressive token merging in the final vision blocks. It coherently analyzes information loss via energy preservation, propagation, and aggregation, and employs a bipartite soft matching-based FOLD mechanism with simple averaging to preserve semantic content. Empirically, FOLDER delivers substantial inference speedups (up to ~2.4×) and training acceleration (≈1.5×) with notable memory savings, while often matching or improving task performance across image-, video-, and multi-vision-tower MLLMs. This makes MLLMs more practical for real-time deployment and scalable training, with the approach acting also as a regularizer that can reduce noise from long visual token sequences.
Abstract
Recently, Multi-modal Large Language Models (MLLMs) have shown remarkable effectiveness for multi-modal tasks due to their abilities to generate and understand cross-modal data. However, processing long sequences of visual tokens extracted from visual backbones poses a challenge for deployment in real-time applications. To address this issue, we introduce FOLDER, a simple yet effective plug-and-play module designed to reduce the length of the visual token sequence, mitigating both computational and memory demands during training and inference. Through a comprehensive analysis of the token reduction process, we analyze the information loss introduced by different reduction strategies and develop FOLDER to preserve key information while removing visual redundancy. We showcase the effectiveness of FOLDER by integrating it into the visual backbone of several MLLMs, significantly accelerating the inference phase. Furthermore, we evaluate its utility as a training accelerator or even performance booster for MLLMs. In both contexts, FOLDER achieves comparable or even better performance than the original models, while dramatically reducing complexity by removing up to 70% of visual tokens.
