Table of Contents
Fetching ...

A.I.R.: Enabling Adaptive, Iterative, and Reasoning-based Frame Selection For Video Question Answering

Yuanhao Zou, Shengji Jin, Andong Deng, Youpeng Zhao, Jun Wang, Chen Chen

TL;DR

The paper addresses the challenge of efficient yet accurate frame selection for VideoQA by proposing A.I.R., a training-free framework that adaptively samples events and iteratively refines a small set of high-potential frames using a reasoning-based VLM. It combines Adaptive Initial Sampling with a two-component GMM threshold to identify event-rich segments and an Interval Potential Ranking plus Localized Density Sampling loop to focus VLM analysis on promising regions, all under an adaptive budget $B$ that scales with video length. The approach yields state-of-the-art or competitive accuracy across diverse long-video benchmarks while reducing the number of frames analyzed and overall computation, demonstrating strong generalization across multiple foundation VLM backbones. It provides extensive ablations, showing the importance of reasoning-based analysis, interval-based ranking, and adaptive budgets, and offers reproducibility details and partial code for researchers to build upon.

Abstract

Effectively applying Vision-Language Models (VLMs) to Video Question Answering (VideoQA) hinges on selecting a concise yet comprehensive set of frames, as processing entire videos is computationally infeasible. However, current frame selection methods face a critical trade-off: approaches relying on lightweight similarity models, such as CLIP, often fail to capture the nuances of complex queries, resulting in inaccurate similarity scores that cannot reflect the authentic query-frame relevance, which further undermines frame selection. Meanwhile, methods that leverage a VLM for deeper analysis achieve higher accuracy but incur prohibitive computational costs. To address these limitations, we propose A.I.R., a training-free approach for Adaptive, Iterative, and Reasoning-based frame selection. We leverage a powerful VLM to perform deep, semantic analysis on complex queries, and this analysis is deployed within a cost-effective iterative loop that processes only a small batch of the most high-potential frames at a time. Extensive experiments on various VideoQA benchmarks demonstrate that our approach outperforms existing frame selection methods, significantly boosts the performance of the foundation VLM, and achieves substantial gains in computational efficiency over other VLM-based techniques.

A.I.R.: Enabling Adaptive, Iterative, and Reasoning-based Frame Selection For Video Question Answering

TL;DR

The paper addresses the challenge of efficient yet accurate frame selection for VideoQA by proposing A.I.R., a training-free framework that adaptively samples events and iteratively refines a small set of high-potential frames using a reasoning-based VLM. It combines Adaptive Initial Sampling with a two-component GMM threshold to identify event-rich segments and an Interval Potential Ranking plus Localized Density Sampling loop to focus VLM analysis on promising regions, all under an adaptive budget that scales with video length. The approach yields state-of-the-art or competitive accuracy across diverse long-video benchmarks while reducing the number of frames analyzed and overall computation, demonstrating strong generalization across multiple foundation VLM backbones. It provides extensive ablations, showing the importance of reasoning-based analysis, interval-based ranking, and adaptive budgets, and offers reproducibility details and partial code for researchers to build upon.

Abstract

Effectively applying Vision-Language Models (VLMs) to Video Question Answering (VideoQA) hinges on selecting a concise yet comprehensive set of frames, as processing entire videos is computationally infeasible. However, current frame selection methods face a critical trade-off: approaches relying on lightweight similarity models, such as CLIP, often fail to capture the nuances of complex queries, resulting in inaccurate similarity scores that cannot reflect the authentic query-frame relevance, which further undermines frame selection. Meanwhile, methods that leverage a VLM for deeper analysis achieve higher accuracy but incur prohibitive computational costs. To address these limitations, we propose A.I.R., a training-free approach for Adaptive, Iterative, and Reasoning-based frame selection. We leverage a powerful VLM to perform deep, semantic analysis on complex queries, and this analysis is deployed within a cost-effective iterative loop that processes only a small batch of the most high-potential frames at a time. Extensive experiments on various VideoQA benchmarks demonstrate that our approach outperforms existing frame selection methods, significantly boosts the performance of the foundation VLM, and achieves substantial gains in computational efficiency over other VLM-based techniques.

Paper Structure

This paper contains 32 sections, 11 equations, 11 figures, 11 tables, 1 algorithm.

Figures (11)

  • Figure 1: An illustration of the general pipeline of query-related frame selection and the key challenges in query-related frame selection. (a) The pipeline features two branches for query-related frame selection: using a lightweight model (e.g., CLIP) to produce a similarity score between the query and each frame, or using a large Analysis VLM to generate a relevance score. (b) Lightweight models suffer from ambiguous similarity, failing on complex queries. (c) Large VLMs lead to a computation cost explosion scaled with frame number.
  • Figure 2: General pipeline of A.I.R. with three stages: (1) Adaptive Initial Sampling that identifies potential 'events' based on query similarity and dynamically samples frames around them using an adaptive budget; (2) Iterative Frame Selection that progressively refines the frame selection via four steps; and (3) QA Stage that feeds the final selected frames into Answering VLM.
  • Figure 3: Two main stages in our A.I.R.. (a) Adaptive Initial Sampling: A GMM-based adaptive threshold is applied to the query-frame similarity $S$ to identify potential events, and then event-wise sampling is conducted on the refined events to obtain $K$ frames ($\mathcal{F}_\mathrm{initial}$). (b) Iterative Frame Selection: In each iteration, 1) High-potential candidates are selected via Interval Potential Ranking; 2) A VLM performs reasoning-based analysis to validate the best frames; 3) An Early Stop mechanism checks if the frame budget is met; And 4) if not met, the Localized Density Sampling (LDS) discovers more frames around the validated frames and feed them into the next iteration. Notably, LDS is performed on the original video ($N$ frames) instead of the uniformly sampled $n$ frames.
  • Figure 4: Accuracy comparison on 6 question types of Video-MME (w/o sub., 32 frames) using InternVL3-8B.
  • Figure 4: Ablations of A.I.R.'s components on Video-MME using InternVL3-8B. We compare on average frames for answering VLMs and accuracy (Acc.).
  • ...and 6 more figures