Table of Contents
Fetching ...

FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance

Haicheng Wang, Zhemeng Yu, Gabriele Spadaro, Chen Ju, Victor Quétu, Shuai Xiao, Enzo Tartaglione

TL;DR

FOLDER addresses the bottleneck of long visual token sequences in Multi-modal Large Language Models by introducing a plug-and-play token-reduction module that concentrates aggressive token merging in the final vision blocks. It coherently analyzes information loss via energy preservation, propagation, and aggregation, and employs a bipartite soft matching-based FOLD mechanism with simple averaging to preserve semantic content. Empirically, FOLDER delivers substantial inference speedups (up to ~2.4×) and training acceleration (≈1.5×) with notable memory savings, while often matching or improving task performance across image-, video-, and multi-vision-tower MLLMs. This makes MLLMs more practical for real-time deployment and scalable training, with the approach acting also as a regularizer that can reduce noise from long visual token sequences.

Abstract

Recently, Multi-modal Large Language Models (MLLMs) have shown remarkable effectiveness for multi-modal tasks due to their abilities to generate and understand cross-modal data. However, processing long sequences of visual tokens extracted from visual backbones poses a challenge for deployment in real-time applications. To address this issue, we introduce FOLDER, a simple yet effective plug-and-play module designed to reduce the length of the visual token sequence, mitigating both computational and memory demands during training and inference. Through a comprehensive analysis of the token reduction process, we analyze the information loss introduced by different reduction strategies and develop FOLDER to preserve key information while removing visual redundancy. We showcase the effectiveness of FOLDER by integrating it into the visual backbone of several MLLMs, significantly accelerating the inference phase. Furthermore, we evaluate its utility as a training accelerator or even performance booster for MLLMs. In both contexts, FOLDER achieves comparable or even better performance than the original models, while dramatically reducing complexity by removing up to 70% of visual tokens.

FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance

TL;DR

FOLDER addresses the bottleneck of long visual token sequences in Multi-modal Large Language Models by introducing a plug-and-play token-reduction module that concentrates aggressive token merging in the final vision blocks. It coherently analyzes information loss via energy preservation, propagation, and aggregation, and employs a bipartite soft matching-based FOLD mechanism with simple averaging to preserve semantic content. Empirically, FOLDER delivers substantial inference speedups (up to ~2.4×) and training acceleration (≈1.5×) with notable memory savings, while often matching or improving task performance across image-, video-, and multi-vision-tower MLLMs. This makes MLLMs more practical for real-time deployment and scalable training, with the approach acting also as a regularizer that can reduce noise from long visual token sequences.

Abstract

Recently, Multi-modal Large Language Models (MLLMs) have shown remarkable effectiveness for multi-modal tasks due to their abilities to generate and understand cross-modal data. However, processing long sequences of visual tokens extracted from visual backbones poses a challenge for deployment in real-time applications. To address this issue, we introduce FOLDER, a simple yet effective plug-and-play module designed to reduce the length of the visual token sequence, mitigating both computational and memory demands during training and inference. Through a comprehensive analysis of the token reduction process, we analyze the information loss introduced by different reduction strategies and develop FOLDER to preserve key information while removing visual redundancy. We showcase the effectiveness of FOLDER by integrating it into the visual backbone of several MLLMs, significantly accelerating the inference phase. Furthermore, we evaluate its utility as a training accelerator or even performance booster for MLLMs. In both contexts, FOLDER achieves comparable or even better performance than the original models, while dramatically reducing complexity by removing up to 70% of visual tokens.
Paper Structure (22 sections, 9 equations, 9 figures, 12 tables, 1 algorithm)

This paper contains 22 sections, 9 equations, 9 figures, 12 tables, 1 algorithm.

Figures (9)

  • Figure 1: FOLDER as Accelerator & Booster. As a plug-and-play module, FOLDER can be used in both training and inference, with considerable acceleration and even performance boost.
  • Figure 2: Minimum Number of Tokens with Energy $t_{E}$ Across Blocks. We evaluate on three types of $t_{E}$ for every block.
  • Figure 3: EMD Distance Between Reduced and Original Output Distributions under 3 Reduction Ratios. We compare the EMD distance by exerting token reduction on different blocks.
  • Figure 4: Comparison of EMD and Accuracy for Aggregation Methods. We compare "direct dropping", "average merging" and "weighted merging" under different reduction ratios.
  • Figure 5: Pipeline of FOLDER. As a plug-and-play module, FOLDER is integrated into the final blocks of the vision backbone (last two here). To deal with reduction overflow, FOLDER automatically executes another FOLD operation when the expected reduction is more than half. The last FOLD, which escapes from reduction overflow, merges tokens according to the remaining reduction numbers.
  • ...and 4 more figures