Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models
Siyou Li, Huanan Wu, Juexi Shao, Yinghao Ma, Yujian Gan, Yihao Luo, Yuwei Wang, Dong Nie, Lu Wang, Wengqing Wu, Le Zhang, Massimo Poesio, Juntao Yu
TL;DR
This work tackles the challenge of long-video understanding in multimodal language models by introducing QTSplus, a query-aware token selector that dynamically gates visual tokens based on a textual query and video statistics. The method scores tokens via cross-attention, predicts an instance-specific retention budget $\rho$, and uses a differentiable gate during training with a lightweight re-encoder to preserve temporal order, significantly cutting the KV-cache and attention burden. Empirically, integrating QTSplus with Qwen2.5-VL yields up to $89\%$ reduction in visual tokens and up to $28\%$ end-to-end latency on long videos, while achieving near-parity or improvements on multiple long-video benchmarks, especially for temporally focused tasks. The results demonstrate that adaptive, relevance-aware tokenization is a practical path to scaling MLLMs to hour-long inputs under realistic compute and memory constraints, with potential extensions to streaming scenarios and multi-query/multi-camera settings.
Abstract
Despite the recent advances in the video understanding ability of multimodal large language models (MLLMs), long video understanding remains a challenge. One of the main issues is that the number of vision tokens grows linearly with video length, which causes an explosion in attention cost, memory, and latency. To solve this challenge, we present Query-aware Token Selector (\textbf{QTSplus}), a lightweight yet powerful visual token selection module that serves as an information gate between the vision encoder and LLMs. Given a text query and video tokens, QTSplus dynamically selects the most important visual evidence for the input text query by (i) scoring visual tokens via cross-attention, (ii) \emph{predicting} an instance-specific retention budget based on the complexity of the query, and (iii) \emph{selecting} Top-$n$ tokens with a differentiable straight-through estimator during training and a hard gate at inference. Furthermore, a small re-encoder preserves temporal order using absolute time information, enabling second-level localization while maintaining global coverage. Integrated into Qwen2.5-VL, QTSplus compresses the vision stream by up to \textbf{89\%} and reduces end-to-end latency by \textbf{28\%} on long videos. The evaluation on eight long video understanding benchmarks shows near-parity accuracy overall when compared with the original Qwen models and outperforms the original model by \textbf{+20.5} and \textbf{+5.6} points respectively on TempCompass direction and order accuracies. These results show that QTSplus is an effective, general mechanism for scaling MLLMs to real-world long-video scenarios while preserving task-relevant evidence.
