Table of Contents
Fetching ...

QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension

Yongdong Luo, Wang Chen, Xiawu Zheng, Weizhong Huang, Shukang Yin, Haojia Lin, Chaoyou Fu, Jinfa Huang, Jiayi Ji, Jiebo Luo, Rongrong Ji

TL;DR

QuoTA tackles input-level visual redundancy in long video understanding by introducing an ante-hoc, training-free token assignment scheme that aligns token usage with query relevance. It scores frames with a lightweight scoring LVLM and employs CoT-driven query decoupling to produce a task-focused object list that guides frame scoring, followed by dynamic token allocation and optional frame sampling. The method is plug-and-play for LVLMs and achieves an average improvement of $3.2\%$ across six benchmarks while maintaining a fixed token budget $N_t=12{,}544$ tokens; it also sets state-of-the-art results on multiple datasets, demonstrating practical gains in long-video tasks with reduced redundancy. This approach offers a scalable, training-free pathway to improve cross-modal reasoning by focusing computation on query-relevant frames and preserving salient semantic content.

Abstract

Recent advances in long video understanding typically mitigate visual redundancy through visual token pruning based on attention distribution. However, while existing methods employ post-hoc low-response token pruning in decoder layers, they overlook the input-level semantic correlation between visual tokens and instructions (query). In this paper, we propose QuoTA, an ante-hoc training-free modular that extends existing large video-language models (LVLMs) for visual token assignment based on query-oriented frame-level importance assessment. The query-oriented token selection is crucial as it aligns visual processing with task-specific requirements, optimizing token budget utilization while preserving semantically relevant content. Specifically, (i) QuoTA strategically allocates frame-level importance scores based on query relevance, enabling one-time visual token assignment before cross-modal interactions in decoder layers, (ii) we decouple the query through Chain-of-Thoughts reasoning to facilitate more precise LVLM-based frame importance scoring, and (iii) QuoTA offers a plug-and-play functionality that extends to existing LVLMs. Extensive experimental results demonstrate that implementing QuoTA with LLaVA-Video-7B yields an average performance improvement of 3.2% across six benchmarks (including Video-MME and MLVU) while operating within an identical visual token budget as the baseline. Codes are open-sourced at https://github.com/MAC-AutoML/QuoTA.

QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension

TL;DR

QuoTA tackles input-level visual redundancy in long video understanding by introducing an ante-hoc, training-free token assignment scheme that aligns token usage with query relevance. It scores frames with a lightweight scoring LVLM and employs CoT-driven query decoupling to produce a task-focused object list that guides frame scoring, followed by dynamic token allocation and optional frame sampling. The method is plug-and-play for LVLMs and achieves an average improvement of across six benchmarks while maintaining a fixed token budget tokens; it also sets state-of-the-art results on multiple datasets, demonstrating practical gains in long-video tasks with reduced redundancy. This approach offers a scalable, training-free pathway to improve cross-modal reasoning by focusing computation on query-relevant frames and preserving salient semantic content.

Abstract

Recent advances in long video understanding typically mitigate visual redundancy through visual token pruning based on attention distribution. However, while existing methods employ post-hoc low-response token pruning in decoder layers, they overlook the input-level semantic correlation between visual tokens and instructions (query). In this paper, we propose QuoTA, an ante-hoc training-free modular that extends existing large video-language models (LVLMs) for visual token assignment based on query-oriented frame-level importance assessment. The query-oriented token selection is crucial as it aligns visual processing with task-specific requirements, optimizing token budget utilization while preserving semantically relevant content. Specifically, (i) QuoTA strategically allocates frame-level importance scores based on query relevance, enabling one-time visual token assignment before cross-modal interactions in decoder layers, (ii) we decouple the query through Chain-of-Thoughts reasoning to facilitate more precise LVLM-based frame importance scoring, and (iii) QuoTA offers a plug-and-play functionality that extends to existing LVLMs. Extensive experimental results demonstrate that implementing QuoTA with LLaVA-Video-7B yields an average performance improvement of 3.2% across six benchmarks (including Video-MME and MLVU) while operating within an identical visual token budget as the baseline. Codes are open-sourced at https://github.com/MAC-AutoML/QuoTA.

Paper Structure

This paper contains 20 sections, 6 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Comparative analysis of Video-MME videomme when implementing attention-based token assignment methods AIM aim and FrameFusion framefusion, alongside our proposed query-oriented QuoTA within LLaVA-Video-7B llavavideo and LLaVA-OV-7B llavaov across varied relative visual token budgets. QuoTA demonstrates superior efficacy while exhibiting consistent performance enhancement across diverse token budget configurations relative to the baseline.
  • Figure 2: The framework of QuoTA. Initially, a dynamic frame sampler extracts $T$ frames from the video based on its duration, which are subsequently processed by ViT to generate visual embeddings $\bm{\mathrm{E}}$. Then, the based LVLM decouples the input query using Chain-of-Thoughts cot reasoning into a decoupled clue to generate frame-wise importance scores through scoring LVLM in parallel, thus evaluating the relevance to the query of each frame. Finally, a token assigner rescales the frame embeddings to $\bm{\mathrm{\hat{E}}}$ based on these importance scores.
  • Figure 3: Qualitative result shown in Video-MME videomme benchmark when applying QuoTA with LLaVA-Video-7B llavavideo. The video frames with a blue border are query-oriented keyframes, and the bar chart shows the normalized scores of QuoTA for each frame.
  • Figure 4: CoT-driven decouple prompt for object list.
  • Figure 5: CoT-driven decouple prompt for video event.
  • ...and 1 more figures