Table of Contents
Fetching ...

Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models

Wei Han, Hui Chen, Min-Yen Kan, Soujanya Poria

TL;DR

Video QA on image–text models is hampered by the high compute cost of video transformers. The authors propose two offline frame-sampling strategies, MIF (question-aware) and MDF (question-agnostic), to select frames that preserve answer-relevant content while enabling efficient fine-tuning of image–text pretrained models. Across CLIP, GIT, and All-in-one backbones on MSVD-QA, MSRVTT-QA, TGIF-QA, and NExT-QA, both methods yield consistent accuracy gains over standard sampling baselines, with MDF offering better efficiency and speedups around $2.5$–$4 imes$. The results demonstrate that carefully designed offline frame sampling can substantially bridge the gap between image-based models and video QA, enabling faster development and potential real-time deployment.

Abstract

Video question-answering is a fundamental task in the field of video understanding. Although current vision--language models (VLMs) equipped with Video Transformers have enabled temporal modeling and yielded superior results, they are at the cost of huge computational power and thus too expensive to deploy in real-time application scenarios. An economical workaround only samples a small portion of frames to represent the main content of that video and tune an image--text model on these sampled frames. Recent video understanding models usually randomly sample a set of frames or clips, regardless of internal correlations between their visual contents, nor their relevance to the problem. We argue that such kinds of aimless sampling may omit the key frames from which the correct answer can be deduced, and the situation gets worse when the sampling sparsity increases, which always happens as the video lengths increase. To mitigate this issue, we propose two frame sampling strategies, namely the most domain frames (MDF) and most implied frames (MIF), to maximally preserve those frames that are most likely vital to the given questions. MDF passively minimizes the risk of key frame omission in a bootstrap manner, while MIS actively searches key frames customized for each video--question pair with the assistance of auxiliary models. The experimental results on three public datasets from three advanced VLMs (CLIP, GIT and All-in-one) demonstrate that our proposed strategies can boost the performance for image-text pretrained models. The source codes pertaining to the method proposed in this paper are publicly available at https://github.com/declare-lab/sas-vqa.

Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models

TL;DR

Video QA on image–text models is hampered by the high compute cost of video transformers. The authors propose two offline frame-sampling strategies, MIF (question-aware) and MDF (question-agnostic), to select frames that preserve answer-relevant content while enabling efficient fine-tuning of image–text pretrained models. Across CLIP, GIT, and All-in-one backbones on MSVD-QA, MSRVTT-QA, TGIF-QA, and NExT-QA, both methods yield consistent accuracy gains over standard sampling baselines, with MDF offering better efficiency and speedups around . The results demonstrate that carefully designed offline frame sampling can substantially bridge the gap between image-based models and video QA, enabling faster development and potential real-time deployment.

Abstract

Video question-answering is a fundamental task in the field of video understanding. Although current vision--language models (VLMs) equipped with Video Transformers have enabled temporal modeling and yielded superior results, they are at the cost of huge computational power and thus too expensive to deploy in real-time application scenarios. An economical workaround only samples a small portion of frames to represent the main content of that video and tune an image--text model on these sampled frames. Recent video understanding models usually randomly sample a set of frames or clips, regardless of internal correlations between their visual contents, nor their relevance to the problem. We argue that such kinds of aimless sampling may omit the key frames from which the correct answer can be deduced, and the situation gets worse when the sampling sparsity increases, which always happens as the video lengths increase. To mitigate this issue, we propose two frame sampling strategies, namely the most domain frames (MDF) and most implied frames (MIF), to maximally preserve those frames that are most likely vital to the given questions. MDF passively minimizes the risk of key frame omission in a bootstrap manner, while MIS actively searches key frames customized for each video--question pair with the assistance of auxiliary models. The experimental results on three public datasets from three advanced VLMs (CLIP, GIT and All-in-one) demonstrate that our proposed strategies can boost the performance for image-text pretrained models. The source codes pertaining to the method proposed in this paper are publicly available at https://github.com/declare-lab/sas-vqa.
Paper Structure (41 sections, 8 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 41 sections, 8 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison between conventional I/O (online sampling) and ours. The blue and green arrows distinguish the dataflow between online sampling methods and ours until the end of preprocessing. The red box highlights the process we alter from conventional routines.
  • Figure 2: Existing sample strategies for video--question answering tasks. In heuristic sampling, the black boxes indicate selected frames.
  • Figure 3: Randomly sampled video frames from the msrvtt-qa dataset and two questions. The bracketed timestamps indicate cues for corresponding answers from the video. The QA pair in the red box cannot be grounded from the four sampled frames.
  • Figure 4: MIF workflow. Here we just show an example of how it selects one frame out of two frames.
  • Figure 5: Sample MDF processing (6 frames). The heatmap visualizes the calculated frame similarity matrix as the cosine value between pairs of frame vectors. The entry at $i^{th}$ row $j^{th}$ column represents the similarity between frames $i$ and $j$. Blue points indicate the frames eventually extracted.
  • ...and 1 more figures