Table of Contents
Fetching ...

VidCtx: Context-aware Video Question Answering with Image Models

Andreas Goulas, Vasileios Mezaris, Ioannis Patras

TL;DR

VidCtx addresses resource limitations in VideoQA by a training-free framework that fuses frame-level visual cues with question-aware captions extracted by a pre-trained LMM. It processes frames individually, enriching each prompt with captions from distant frames to capture temporal relations, and aggregates frame-level decisions via max pooling to produce a video-level answer, yielding a linear $O(n)$ inference complexity in the number of frames. The approach achieves competitive zero-shot results on NExT-QA, IntentQA, and STAR, outperforming several open baselines and approaching proprietary LLM-based methods. This simple yet effective context integration demonstrates strong scalability to long videos without fine-tuning, and provides a practical pathway for open-model VideoQA systems.

Abstract

To address computational and memory limitations of Large Multimodal Models in the Video Question-Answering task, several recent methods extract textual representations per frame (e.g., by captioning) and feed them to a Large Language Model (LLM) that processes them to produce the final response. However, in this way, the LLM does not have access to visual information and often has to process repetitive textual descriptions of nearby frames. To address those shortcomings, in this paper, we introduce VidCtx, a novel training-free VideoQA framework which integrates both modalities, i.e. both visual information from input frames and textual descriptions of others frames that give the appropriate context. More specifically, in the proposed framework a pre-trained Large Multimodal Model (LMM) is prompted to extract at regular intervals, question-aware textual descriptions (captions) of video frames. Those will be used as context when the same LMM will be prompted to answer the question at hand given as input a) a certain frame, b) the question and c) the context/caption of an appropriate frame. To avoid redundant information, we chose as context the descriptions of distant frames. Finally, a simple yet effective max pooling mechanism is used to aggregate the frame-level decisions. This methodology enables the model to focus on the relevant segments of the video and scale to a high number of frames. Experiments show that VidCtx achieves competitive performance among approaches that rely on open models on three public Video QA benchmarks, NExT-QA, IntentQA and STAR. Our code is available at https://github.com/IDT-ITI/VidCtx.

VidCtx: Context-aware Video Question Answering with Image Models

TL;DR

VidCtx addresses resource limitations in VideoQA by a training-free framework that fuses frame-level visual cues with question-aware captions extracted by a pre-trained LMM. It processes frames individually, enriching each prompt with captions from distant frames to capture temporal relations, and aggregates frame-level decisions via max pooling to produce a video-level answer, yielding a linear inference complexity in the number of frames. The approach achieves competitive zero-shot results on NExT-QA, IntentQA, and STAR, outperforming several open baselines and approaching proprietary LLM-based methods. This simple yet effective context integration demonstrates strong scalability to long videos without fine-tuning, and provides a practical pathway for open-model VideoQA systems.

Abstract

To address computational and memory limitations of Large Multimodal Models in the Video Question-Answering task, several recent methods extract textual representations per frame (e.g., by captioning) and feed them to a Large Language Model (LLM) that processes them to produce the final response. However, in this way, the LLM does not have access to visual information and often has to process repetitive textual descriptions of nearby frames. To address those shortcomings, in this paper, we introduce VidCtx, a novel training-free VideoQA framework which integrates both modalities, i.e. both visual information from input frames and textual descriptions of others frames that give the appropriate context. More specifically, in the proposed framework a pre-trained Large Multimodal Model (LMM) is prompted to extract at regular intervals, question-aware textual descriptions (captions) of video frames. Those will be used as context when the same LMM will be prompted to answer the question at hand given as input a) a certain frame, b) the question and c) the context/caption of an appropriate frame. To avoid redundant information, we chose as context the descriptions of distant frames. Finally, a simple yet effective max pooling mechanism is used to aggregate the frame-level decisions. This methodology enables the model to focus on the relevant segments of the video and scale to a high number of frames. Experiments show that VidCtx achieves competitive performance among approaches that rely on open models on three public Video QA benchmarks, NExT-QA, IntentQA and STAR. Our code is available at https://github.com/IDT-ITI/VidCtx.

Paper Structure

This paper contains 16 sections, 3 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Comparison of video understanding frameworks. Related works typically either project the input frames to the LLM input space or rely on extracted captions to answer questions about videos. We propose combining both modalities, processing the video frame-by-frame and inserting the relevant context (i.e., extracted question-aware captions) as part of the LLM prompt. In our architecture, we use the same LMM for both captioning and question answering.
  • Figure 2: Qualitative study of VidCtx on NExT-QA. We select one question from each question category and we present the normalized answer scores for pairs of distant frames, along with their question aware captions.