Table of Contents
Fetching ...

Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration

Shaoguang Wang, Weiyu Guo, Ziyang Chen, Yijie Xu, Xuming Hu, Hui Xiong

TL;DR

This work addresses the prohibitive token costs of Video-QA with Multimodal LLMs by identifying visual echoes—redundant frames that dilute context. It introduces Adaptive Frame-Pruning (AFP) to adaptively cluster and collapse redundant frames and couples this with a lightweight textual semantic graph to preserve semantic context with minimal text. Empirical results on Long VideoBench and VideoMME show up to an 80% reduction in input tokens and improved or competitive accuracy, especially for open-source models, highlighting a strong efficiency-accuracy trade-off. The proposed framework is model- and selector-agnostic, offering a practical path toward scalable, token-efficient long-form video understanding and suggesting future work toward a Multimodal Semantic Graph that fuses audio and motion cues.

Abstract

The practical application of Multimodal Large Language Models (MLLMs) to Video Question Answering (Video-QA) is severely hindered by the high token cost of processing numerous video frames. While keyframe selection is the dominant strategy to mitigate this, we identify that even state-of-the-art selectors produce prompts laden with significant temporal redundancy, a challenge unique to video that we term 'visual echoes'. This issue leads to context dilution and can paradoxically degrade performance. To address this dual challenge, we propose a novel refinement framework that synergistically combines Adaptive Frame-Pruning (AFP) with a lightweight text-based semantic graph. AFP intelligently prunes 'visual echoes' by adaptively clustering frames, while the semantic graph provides crucial, low-cost semantic compensation. Conducting extensive experiments on the LongVideoBench and VideoMME benchmarks against multiple state-of-the-art selectors, our approach demonstrates a drastic reduction in total input tokens by up to 80%. Crucially, by creating a concise, high-quality prompt, our framework not only enhances efficiency but also demonstrates a remarkable ability to robustify and improve the accuracy of upstream selectors, achieving results that are highly competitive with, and often superior to, baselines that use vastly more frames.

Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration

TL;DR

This work addresses the prohibitive token costs of Video-QA with Multimodal LLMs by identifying visual echoes—redundant frames that dilute context. It introduces Adaptive Frame-Pruning (AFP) to adaptively cluster and collapse redundant frames and couples this with a lightweight textual semantic graph to preserve semantic context with minimal text. Empirical results on Long VideoBench and VideoMME show up to an 80% reduction in input tokens and improved or competitive accuracy, especially for open-source models, highlighting a strong efficiency-accuracy trade-off. The proposed framework is model- and selector-agnostic, offering a practical path toward scalable, token-efficient long-form video understanding and suggesting future work toward a Multimodal Semantic Graph that fuses audio and motion cues.

Abstract

The practical application of Multimodal Large Language Models (MLLMs) to Video Question Answering (Video-QA) is severely hindered by the high token cost of processing numerous video frames. While keyframe selection is the dominant strategy to mitigate this, we identify that even state-of-the-art selectors produce prompts laden with significant temporal redundancy, a challenge unique to video that we term 'visual echoes'. This issue leads to context dilution and can paradoxically degrade performance. To address this dual challenge, we propose a novel refinement framework that synergistically combines Adaptive Frame-Pruning (AFP) with a lightweight text-based semantic graph. AFP intelligently prunes 'visual echoes' by adaptively clustering frames, while the semantic graph provides crucial, low-cost semantic compensation. Conducting extensive experiments on the LongVideoBench and VideoMME benchmarks against multiple state-of-the-art selectors, our approach demonstrates a drastic reduction in total input tokens by up to 80%. Crucially, by creating a concise, high-quality prompt, our framework not only enhances efficiency but also demonstrates a remarkable ability to robustify and improve the accuracy of upstream selectors, achieving results that are highly competitive with, and often superior to, baselines that use vastly more frames.

Paper Structure

This paper contains 40 sections, 2 equations, 19 figures, 11 tables, 1 algorithm.

Figures (19)

  • Figure 1: Conceptual Overview of our Refinement Framework. (a) An initial prompt from an upstream selector contains numerous 'visual echoes' and has a high token cost. (b) Our framework refines this by pruning redundant frames with AFP and compensating with a semantic graph, resulting in an optimized, low-cost prompt that leads to the correct answer. Token counts are illustrative estimates based on OpenAI's guidelines (see Section \ref{['sec:exp_setup']}).
  • Figure 2: "Visual Echoes" are a Prevalent Issue Across Mainstream Keyframe Selectors. We visualize 32 keyframes selected by three SOTA methods for a video QA task involving narrative understanding. All selectors exhibit severe redundancy, producing multiple near-identical frames for iconic subjects. We use colored bounding boxes to highlight these clusters of "visual echoes," such as the Fuji mountain view, the seated man, and the red Torii gate.
  • Figure 3: The Overall Pipeline of Our Proposed Method. An upstream selector provides initial frames. Our Adaptive Frame-Pruning (AFP) module then takes over, performing (1) fused feature extraction and adaptive clustering to produce representative keyframes, and (2) concurrent semantic graph generation. Both are combined into an optimized prompt for the MLLM.
  • Figure 4: The template of our textualized semantic graph. This concise text block, which can be generated, is inserted directly into the downstream Video-QA period as a part of the prompt for MLLM to provide high-level semantic context.
  • Figure 5: Average estimated token consumption comparison across datasets and methods. Token counts are estimated based on OpenAI's guidelines (see Section \ref{['sec:exp_setup']} for details). Our method (AFP + Graph) consistently and drastically reduces the token requirements compared to the AKS* baseline across all settings and on both Long VideoBench and VideoMME datasets, highlighting its superior and generalizable efficiency.
  • ...and 14 more figures