Table of Contents
Fetching ...

ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs

Chaoyu Li, Yogesh Kulkarni, Pooyan Fazli

TL;DR

ReGATE tackles the heavy training cost of multimodal LLMs by introducing a reference-guided, adaptive token elision mechanism that gates computation during training without architectural changes. It uses a frozen text-only teacher to generate per-token reference losses and combines this with an EMA-based measure of the student’s difficulty to produce a per-token importance score, selecting the most informative tokens in a dynamic, cycle-based sparsity schedule. Across three representative MLLMs for image and video tasks, ReGATE achieves substantial token reductions and training-time speedups while maintaining or improving accuracy, demonstrating robustness across architectures and training regimes. Ablation studies confirm the importance of a balanced lambda and a capacity-aligned teacher, supporting the method’s practical applicability and generality.

Abstract

The computational cost of training multimodal large language models (MLLMs) grows rapidly with the number of processed tokens. Existing efficiency methods mainly target inference via token reduction or merging, offering limited benefits during training. We introduce ReGATE (Reference-Guided Adaptive Token Elision), an adaptive token pruning method for accelerating MLLM training. ReGATE adopts a teacher-student framework, in which a frozen teacher LLM provides per-token guidance losses that are fused with an exponential moving average of the student's difficulty estimates. This adaptive scoring mechanism dynamically selects informative tokens while skipping redundant ones in the forward pass, substantially reducing computation without altering the model architecture. Across three representative MLLMs, ReGATE matches the peak accuracy of standard training on MVBench up to 2$\times$ faster, using only 38% of the tokens. With extended training, it even surpasses the baseline across multiple multimodal benchmarks, cutting total token usage by over 41%. Code and models will be released publicly.

ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs

TL;DR

ReGATE tackles the heavy training cost of multimodal LLMs by introducing a reference-guided, adaptive token elision mechanism that gates computation during training without architectural changes. It uses a frozen text-only teacher to generate per-token reference losses and combines this with an EMA-based measure of the student’s difficulty to produce a per-token importance score, selecting the most informative tokens in a dynamic, cycle-based sparsity schedule. Across three representative MLLMs for image and video tasks, ReGATE achieves substantial token reductions and training-time speedups while maintaining or improving accuracy, demonstrating robustness across architectures and training regimes. Ablation studies confirm the importance of a balanced lambda and a capacity-aligned teacher, supporting the method’s practical applicability and generality.

Abstract

The computational cost of training multimodal large language models (MLLMs) grows rapidly with the number of processed tokens. Existing efficiency methods mainly target inference via token reduction or merging, offering limited benefits during training. We introduce ReGATE (Reference-Guided Adaptive Token Elision), an adaptive token pruning method for accelerating MLLM training. ReGATE adopts a teacher-student framework, in which a frozen teacher LLM provides per-token guidance losses that are fused with an exponential moving average of the student's difficulty estimates. This adaptive scoring mechanism dynamically selects informative tokens while skipping redundant ones in the forward pass, substantially reducing computation without altering the model architecture. Across three representative MLLMs, ReGATE matches the peak accuracy of standard training on MVBench up to 2 faster, using only 38% of the tokens. With extended training, it even surpasses the baseline across multiple multimodal benchmarks, cutting total token usage by over 41%. Code and models will be released publicly.

Paper Structure

This paper contains 21 sections, 4 equations, 5 figures, 9 tables, 1 algorithm.

Figures (5)

  • Figure 1: In training MLLMs, ReGATE identifies important textual tokens (light green) and selectively propagates them, while skipping unimportant ones (red).
  • Figure 2: Overview of ReGATE. The framework operates in two interconnected stages. (1) Reference Loss Generation (Left): A frozen, text-only teacher LLM processes the input text (with padding tokens) and computes a per-token reference loss (ref_loss), which measures how difficult each token is to predict from text alone. Higher loss values suggest the token likely requires visual grounding (e.g., "white", "red stripe"). (2) Student Training (Right): The ref_loss is combined with the student model’s historical learning difficulty to produce a unified importance score. This score is used to create a binary mask that selects the most informative tokens. During training, the student LLM receives the full multimodal input but only performs computation (e.g., self-attention and feed-forward operations) on the selected tokens, while skipping the rest.
  • Figure 3: Zero-shot accuracy on MVBench during fine-tuning.ReGATE (red) consistently outperforms standard fine-tuning (orange) at the same token count. It reaches the baseline’s peak accuracy roughly twice as fast while using only 38% of the tokens on average, and surpasses the baseline with 41% fewer tokens.
  • Figure 4: Attention maps from standard fine-tuning and ReGATE on video QA tasks. ReGATE focuses on contextually relevant regions (e.g., hands and manipulated objects), whereas standard fine-tuning spreads attention across the background.
  • Figure 5: Qualitative examples illustrating the effectiveness of the reference loss signal. For two video Q&A pairs, we show the per-token reference loss computed by a text-only teacher model (Mistral-7B). Tokens colored in red have the highest losses and represent the top 50% most difficult tokens to predict from text alone. These are precisely the tokens that ReGATE prioritizes for computation.