Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration
Shaoguang Wang, Weiyu Guo, Ziyang Chen, Yijie Xu, Xuming Hu, Hui Xiong
TL;DR
This work addresses the prohibitive token costs of Video-QA with Multimodal LLMs by identifying visual echoes—redundant frames that dilute context. It introduces Adaptive Frame-Pruning (AFP) to adaptively cluster and collapse redundant frames and couples this with a lightweight textual semantic graph to preserve semantic context with minimal text. Empirical results on Long VideoBench and VideoMME show up to an 80% reduction in input tokens and improved or competitive accuracy, especially for open-source models, highlighting a strong efficiency-accuracy trade-off. The proposed framework is model- and selector-agnostic, offering a practical path toward scalable, token-efficient long-form video understanding and suggesting future work toward a Multimodal Semantic Graph that fuses audio and motion cues.
Abstract
The practical application of Multimodal Large Language Models (MLLMs) to Video Question Answering (Video-QA) is severely hindered by the high token cost of processing numerous video frames. While keyframe selection is the dominant strategy to mitigate this, we identify that even state-of-the-art selectors produce prompts laden with significant temporal redundancy, a challenge unique to video that we term 'visual echoes'. This issue leads to context dilution and can paradoxically degrade performance. To address this dual challenge, we propose a novel refinement framework that synergistically combines Adaptive Frame-Pruning (AFP) with a lightweight text-based semantic graph. AFP intelligently prunes 'visual echoes' by adaptively clustering frames, while the semantic graph provides crucial, low-cost semantic compensation. Conducting extensive experiments on the LongVideoBench and VideoMME benchmarks against multiple state-of-the-art selectors, our approach demonstrates a drastic reduction in total input tokens by up to 80%. Crucially, by creating a concise, high-quality prompt, our framework not only enhances efficiency but also demonstrates a remarkable ability to robustify and improve the accuracy of upstream selectors, achieving results that are highly competitive with, and often superior to, baselines that use vastly more frames.
