Table of Contents
Fetching ...

ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning

Huanxuan Liao, Zhongtao Jiang, Yupu Hao, Yuqiao Tan, Shizhu He, Ben Wang, Jun Zhao, Kun Xu, Kang Liu

Abstract

Multimodal Large Language Models (MLLMs) achieve stronger visual understanding by scaling input fidelity, yet the resulting visual token growth makes jointly sustaining high spatial resolution and long temporal context prohibitive. We argue that the bottleneck lies not in how post-encoding representations are compressed but in the volume of pixels the encoder receives, and address it with ResAdapt, an Input-side adaptation framework that learns how much visual budget each frame should receive before encoding. ResAdapt couples a lightweight Allocator with an unchanged MLLM backbone, so the backbone retains its native visual-token interface while receiving an operator-transformed input. We formulate allocation as a contextual bandit and train the Allocator with Cost-Aware Policy Optimization (CAPO), which converts sparse rollout feedback into a stable accuracy-cost learning signal. Across budget-controlled video QA, temporal grounding, and image reasoning tasks, ResAdapt improves low-budget operating points and often lies on or near the efficiency-accuracy frontier, with the clearest gains on reasoning-intensive benchmarks under aggressive compression. Notably, ResAdapt supports up to 16x more frames at the same visual budget while delivering over 15% performance gain. Code is available at https://github.com/Xnhyacinth/ResAdapt.

ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning

Abstract

Multimodal Large Language Models (MLLMs) achieve stronger visual understanding by scaling input fidelity, yet the resulting visual token growth makes jointly sustaining high spatial resolution and long temporal context prohibitive. We argue that the bottleneck lies not in how post-encoding representations are compressed but in the volume of pixels the encoder receives, and address it with ResAdapt, an Input-side adaptation framework that learns how much visual budget each frame should receive before encoding. ResAdapt couples a lightweight Allocator with an unchanged MLLM backbone, so the backbone retains its native visual-token interface while receiving an operator-transformed input. We formulate allocation as a contextual bandit and train the Allocator with Cost-Aware Policy Optimization (CAPO), which converts sparse rollout feedback into a stable accuracy-cost learning signal. Across budget-controlled video QA, temporal grounding, and image reasoning tasks, ResAdapt improves low-budget operating points and often lies on or near the efficiency-accuracy frontier, with the clearest gains on reasoning-intensive benchmarks under aggressive compression. Notably, ResAdapt supports up to 16x more frames at the same visual budget while delivering over 15% performance gain. Code is available at https://github.com/Xnhyacinth/ResAdapt.

Paper Structure

This paper contains 39 sections, 41 equations, 20 figures, 7 tables, 1 algorithm.

Figures (20)

  • Figure 1: Input-side Adaptation improves the visual-token efficiency frontier.(a) Three efficiency paradigms for video reasoning. Model-side methods compress tokens after encoding; output-side methods iteratively retrieve or zoom; ResAdapt reallocates per-frame visual budget before encoding, preserving the backbone's native token interface and compatibility with optimized inference engines. (b) Qwen2.5-VL-7B results with 32 frames at ${\sim}$10% visual retention, where ResAdapt lies on or near the Pareto frontier and shows its largest gain on the reasoning-heavy benchmark.
  • Figure 2: ResAdapt framework.(a) At inference, a lightweight Allocator $\pi_\theta$ maps coarse visual features and the query to latent actions $a_t\sim\mathrm{Beta}(\alpha_t,\beta_t)$, which parameterize per-frame input allocations. In the resize instantiation used in our experiments, these allocations are realized as scales $s_t\in[s_{\min},s_{\max}]$, and the resized frames are processed by the MLLM in a single call. (b) During training, CAPO reshapes group-relative advantages with a dynamic cost pivot $\tau_{\text{dyn}}$, while temporal-similarity regularization suppresses redundant high-budget allocation on adjacent similar frames.
  • Figure 3: Efficiency--accuracy trade-offs and temporal reallocation.(a,b) VideoMMMU and VideoMME versus visual-token retention ratio $R$. ResAdapt is on or near the Pareto frontier, with the clearest advantage on reasoning-heavy settings at low retention. (c) Relative gain from trading spatial resolution for temporal coverage under a fixed 8-frame-equivalent budget.
  • Figure 4: Emergent active perception. Per-frame scale $s_t$ over frame index for six VideoMME videos, grouped by intra-video scale diversity $\sigma$. High-diversity videos show localized scale spikes on scene changes, text overlays, and rapid motion; low-diversity videos remain near-uniform.
  • Figure 5: Global allocation statistics on VideoMME.(a) Aggregate predicted scale by frame position. (b) Case$\times$frame heatmap for the first 200 videos. High-scale allocation appears as localized bursts rather than a fixed positional pattern.
  • ...and 15 more figures