HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models

Xiangyu Bai; Bishoy Galoaa; Sarah Ostadabbas

HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models

Xiangyu Bai, Bishoy Galoaa, Sarah Ostadabbas

Abstract

Video question answering (VQA) with vision-language models (VLMs) depends critically on which frames are selected from the input video, yet most systems rely on uniform or heuristic sampling that cannot be optimized for downstream answering quality. We introduce \textbf{HORNet}, a lightweight frame selection policy trained with Group Relative Policy Optimization (GRPO) to learn which frames a frozen VLM needs to answer questions correctly. With fewer than 1M trainable parameters, HORNet reduces input frames by up to 99\% and VLM processing time by up to 93\%, while improving answer quality on short-form benchmarks (+1.7\% F1 on MSVD-QA) and achieving strong performance on temporal reasoning tasks (+7.3 points over uniform sampling on NExT-QA). We formalize this as Select Any Frames (SAF), a task that decouples visual input curation from VLM reasoning, and show that GRPO-trained selection generalizes better out-of-distribution than supervised and PPO alternatives. HORNet's policy further transfers across VLM answerers without retraining, yielding an additional 8.5\% relative gain when paired with a stronger model. Evaluated across six benchmarks spanning 341,877 QA pairs and 114.2 hours of video, our results demonstrate that optimizing \emph{what} a VLM sees is a practical and complementary alternative to optimizing what it generates while improving efficiency. Code is available at https://github.com/ostadabbas/HORNet.

HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models

Abstract

Paper Structure (28 sections, 9 equations, 3 figures, 6 tables)

This paper contains 28 sections, 9 equations, 3 figures, 6 tables.

Introduction
Small Data Statement
Related Work
Frame selection for video understanding.
Reinforcement learning for frame selection.
GRPO for vision-language models.
Method
Problem Formulation
Policy parameterization.
Video Representation
HORNet Architecture
Video encoder.
Policy network.
Frozen VLM answerer.
Training with GRPO
...and 13 more sections

Figures (3)

Figure 1: HORNet pipeline. Given a video $\mathbf{V} = \{v_1, v_2, \ldots, v_T\}$ with $T$ uniformly sampled frames, our TimeSFormer-based video encoder $E$ extracts per-frame features $\mathbf{F} \in \mathbb{R}^{T \times D}$. A lightweight trainable MLP policy$\pi_\theta$ scores each frame independently, producing keep probabilities $p_t \in [0,1]$ and a binary selection mask $\mathbf{b} \in \{0,1\}^T$. Only the frames selected by the mask ($\mathbf{V'} \subseteq \mathbf{V}$) are passed to the frozen Qwen3-VL answerer. For example, to answer "How does the boy in black react while the boy on the green disc goes down?", HORNet selects only the frames capturing the key interaction moment, discarding irrelevant context; correctly predicting "Look at him". At training time, GRPO samples $K$ candidate subsets, evaluates each via String F1 reward against the ground-truth answer, and updates $\pi_\theta$ through group-normalized policy gradients. The VLM answerer remain frozen throughout while the encoder and MLP policy are trainable.
Figure 2: HORNet encoder $E$. Input frames are patchified with a $P\times P$ convolution, processed by spatial self-attention within each frame, and then by temporal self-attention across frames at each patch location. The resulting temporally contextualized patch tokens are pooled to yield per-frame video representations used by HORNet for frame selection. $B$ is batch size, $T$ is frame count and $D$ is hidden dimension. We set $P$=16, $T$=32 and $D$=768 in our training.
Figure 3: Qualitative example of HORNet’s frame-selection behavior on an MCQ and open-ended sample from the NExT-QA dataset nextqa. Given fixed 8-frame input, uniform sampling in Qwen-VL captures frames of the child crawling instead of the slide following the discard of the cart (left), and a frame of a person working at a computer while missing the eating frames (right), leading the model to produce an incorrect answer. With a dense initial sampling ($T$=256 frames), HORNet selects the full 8-frame sequence of action-relevant frames while discarding distractors, enabling the model to recover the correct prediction.

HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models

Abstract

HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models

Authors

Abstract

Table of Contents

Figures (3)