Table of Contents
Fetching ...

AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding

Haozhe Qi, Kevin Qu, Mahdi Rad, Rui Wang, Alexander Mathis, Marc Pollefeys

Abstract

Long video understanding remains challenging for Multi-modal Large Language Models (MLLMs) due to high memory costs and context-length limits. Prior approaches mitigate this by scoring and selecting frames/tokens within short clips, but they lack a principled mechanism to (i) compare relevance across distant video clips and (ii) stop processing once sufficient evidence has been gathered. We propose AdaptToken, a training-free framework that turns an MLLM's self-uncertainty into a global control signal for long-video token selection. AdaptToken splits a video into groups, extracts cross-modal attention to rank tokens within each group, and uses the model's response entropy to estimate each group's prompt relevance. This entropy signal enables a global token budget allocation across groups and further supports early stopping (AdaptToken-Lite), skipping the remaining groups when the model becomes sufficiently certain. Across four long-video benchmarks (VideoMME, LongVideoBench, LVBench, and MLVU) and multiple base MLLMs (7B-72B), AdaptToken consistently improves accuracy (e.g., +6.7 on average over Qwen2.5-VL 7B) and continues to benefit from extremely long inputs (up to 10K frames), while AdaptToken-Lite reduces inference time by about half with comparable performance. Project page: https://haozheqi.github.io/adapt-token

AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding

Abstract

Long video understanding remains challenging for Multi-modal Large Language Models (MLLMs) due to high memory costs and context-length limits. Prior approaches mitigate this by scoring and selecting frames/tokens within short clips, but they lack a principled mechanism to (i) compare relevance across distant video clips and (ii) stop processing once sufficient evidence has been gathered. We propose AdaptToken, a training-free framework that turns an MLLM's self-uncertainty into a global control signal for long-video token selection. AdaptToken splits a video into groups, extracts cross-modal attention to rank tokens within each group, and uses the model's response entropy to estimate each group's prompt relevance. This entropy signal enables a global token budget allocation across groups and further supports early stopping (AdaptToken-Lite), skipping the remaining groups when the model becomes sufficiently certain. Across four long-video benchmarks (VideoMME, LongVideoBench, LVBench, and MLVU) and multiple base MLLMs (7B-72B), AdaptToken consistently improves accuracy (e.g., +6.7 on average over Qwen2.5-VL 7B) and continues to benefit from extremely long inputs (up to 10K frames), while AdaptToken-Lite reduces inference time by about half with comparable performance. Project page: https://haozheqi.github.io/adapt-token

Paper Structure

This paper contains 18 sections, 7 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: We propose AdaptToken, a flexible and efficient token selection strategy for long video understanding. Compared with state-of-the-art frame/token selection methods on several challenging long-video benchmarks, AdaptToken consistently delivers improved performance.
  • Figure 2: Overall pipeline of AdaptToken. AdaptToken processes long videos by dividing them into frame groups and selecting informative tokens within each group based on group relevance estimated from response entropy. It progressively gathers evidence across groups and stops processing once sufficient information has been collected.
  • Figure 3: Needle-in-a-Haystack experiments based on InternVL2.5 8B. Response entropy distributions for correct vs. incorrect predictions under varying numbers of input frames, with and without needles.
  • Figure 4: Real-data entropy experiments on based on InternVL2.5 8B. Response-entropy distributions for correct vs. incorrect predictions on real-world benchmarks (VideoMME and MLVU).
  • Figure 5: Visualization of AdaptToken token selection. Two frame groups are presented side by side. For each clip, we first estimate intra-group token relevance via cross-modal attention (heatmaps in the second row), and group-level relevance via response entropy, which measures the model’s answer confidence. Based on these signals, we perform global-aware token selection, adaptively allocating a larger token budget to groups that are more relevant to the text prompt (colored masks in the third row). The resulting token set is compact yet information-dense, improving both accuracy and inference efficiency for long-video understanding.
  • ...and 1 more figures