QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension
Yongdong Luo, Wang Chen, Xiawu Zheng, Weizhong Huang, Shukang Yin, Haojia Lin, Chaoyou Fu, Jinfa Huang, Jiayi Ji, Jiebo Luo, Rongrong Ji
TL;DR
QuoTA tackles input-level visual redundancy in long video understanding by introducing an ante-hoc, training-free token assignment scheme that aligns token usage with query relevance. It scores frames with a lightweight scoring LVLM and employs CoT-driven query decoupling to produce a task-focused object list that guides frame scoring, followed by dynamic token allocation and optional frame sampling. The method is plug-and-play for LVLMs and achieves an average improvement of $3.2\%$ across six benchmarks while maintaining a fixed token budget $N_t=12{,}544$ tokens; it also sets state-of-the-art results on multiple datasets, demonstrating practical gains in long-video tasks with reduced redundancy. This approach offers a scalable, training-free pathway to improve cross-modal reasoning by focusing computation on query-relevant frames and preserving salient semantic content.
Abstract
Recent advances in long video understanding typically mitigate visual redundancy through visual token pruning based on attention distribution. However, while existing methods employ post-hoc low-response token pruning in decoder layers, they overlook the input-level semantic correlation between visual tokens and instructions (query). In this paper, we propose QuoTA, an ante-hoc training-free modular that extends existing large video-language models (LVLMs) for visual token assignment based on query-oriented frame-level importance assessment. The query-oriented token selection is crucial as it aligns visual processing with task-specific requirements, optimizing token budget utilization while preserving semantically relevant content. Specifically, (i) QuoTA strategically allocates frame-level importance scores based on query relevance, enabling one-time visual token assignment before cross-modal interactions in decoder layers, (ii) we decouple the query through Chain-of-Thoughts reasoning to facilitate more precise LVLM-based frame importance scoring, and (iii) QuoTA offers a plug-and-play functionality that extends to existing LVLMs. Extensive experimental results demonstrate that implementing QuoTA with LLaVA-Video-7B yields an average performance improvement of 3.2% across six benchmarks (including Video-MME and MLVU) while operating within an identical visual token budget as the baseline. Codes are open-sourced at https://github.com/MAC-AutoML/QuoTA.
