Table of Contents
Fetching ...

VideoDistill: Language-aware Vision Distillation for Video Question Answering

Bo Zou, Chao Yang, Yu Qiao, Chengbin Quan, Youjian Zhao

TL;DR

VideoDistill tackles long-form VideoQA by addressing language bias and long-range reasoning through a language-aware visually driven distillation framework. It introduces LA-Gate to enable goal-driven interaction without fusing language into visual representations, supplemented by a differentiable sparse sampling module and a vision refinement stack to extract multi-scale, question-relevant semantics. The model is pretrained with Video-Text Matching, Vision-Guided Masked Language Modeling, and a lightweight contrastive objective, and it achieves state-of-the-art results on both generic and long-form VideoQA benchmarks while reducing language shortcut usage in EgoTaskQA. This approach enhances robustness to varying input frame counts and supports efficient, scalable reasoning over extended video content.

Abstract

Significant advancements in video question answering (VideoQA) have been made thanks to thriving large image-language pretraining frameworks. Although these image-language models can efficiently represent both video and language branches, they typically employ a goal-free vision perception process and do not interact vision with language well during the answer generation, thus omitting crucial visual cues. In this paper, we are inspired by the human recognition and learning pattern and propose VideoDistill, a framework with language-aware (i.e., goal-driven) behavior in both vision perception and answer generation process. VideoDistill generates answers only from question-related visual embeddings and follows a thinking-observing-answering approach that closely resembles human behavior, distinguishing it from previous research. Specifically, we develop a language-aware gating mechanism to replace the standard cross-attention, avoiding language's direct fusion into visual representations. We incorporate this mechanism into two key components of the entire framework. The first component is a differentiable sparse sampling module, which selects frames containing the necessary dynamics and semantics relevant to the questions. The second component is a vision refinement module that merges existing spatial-temporal attention layers to ensure the extraction of multi-grained visual semantics associated with the questions. We conduct experimental evaluations on various challenging video question-answering benchmarks, and VideoDistill achieves state-of-the-art performance in both general and long-form VideoQA datasets. In Addition, we verify that VideoDistill can effectively alleviate the utilization of language shortcut solutions in the EgoTaskQA dataset.

VideoDistill: Language-aware Vision Distillation for Video Question Answering

TL;DR

VideoDistill tackles long-form VideoQA by addressing language bias and long-range reasoning through a language-aware visually driven distillation framework. It introduces LA-Gate to enable goal-driven interaction without fusing language into visual representations, supplemented by a differentiable sparse sampling module and a vision refinement stack to extract multi-scale, question-relevant semantics. The model is pretrained with Video-Text Matching, Vision-Guided Masked Language Modeling, and a lightweight contrastive objective, and it achieves state-of-the-art results on both generic and long-form VideoQA benchmarks while reducing language shortcut usage in EgoTaskQA. This approach enhances robustness to varying input frame counts and supports efficient, scalable reasoning over extended video content.

Abstract

Significant advancements in video question answering (VideoQA) have been made thanks to thriving large image-language pretraining frameworks. Although these image-language models can efficiently represent both video and language branches, they typically employ a goal-free vision perception process and do not interact vision with language well during the answer generation, thus omitting crucial visual cues. In this paper, we are inspired by the human recognition and learning pattern and propose VideoDistill, a framework with language-aware (i.e., goal-driven) behavior in both vision perception and answer generation process. VideoDistill generates answers only from question-related visual embeddings and follows a thinking-observing-answering approach that closely resembles human behavior, distinguishing it from previous research. Specifically, we develop a language-aware gating mechanism to replace the standard cross-attention, avoiding language's direct fusion into visual representations. We incorporate this mechanism into two key components of the entire framework. The first component is a differentiable sparse sampling module, which selects frames containing the necessary dynamics and semantics relevant to the questions. The second component is a vision refinement module that merges existing spatial-temporal attention layers to ensure the extraction of multi-grained visual semantics associated with the questions. We conduct experimental evaluations on various challenging video question-answering benchmarks, and VideoDistill achieves state-of-the-art performance in both general and long-form VideoQA datasets. In Addition, we verify that VideoDistill can effectively alleviate the utilization of language shortcut solutions in the EgoTaskQA dataset.
Paper Structure (26 sections, 10 equations, 8 figures, 11 tables)

This paper contains 26 sections, 10 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Challenges of goal-free VideoQA models. They can not efficiently handle (a) Long-term dependencies, (b) Multi-events, and (c) Multi-scale semantics in the videos. They also suffer from (d) language prior phenomenon in training question-answer pairs.
  • Figure 2: Overview of VideoDistill. VideoDistill first densely samples video frames and utilizes a pre-trained image-language encoder to extract features, then sparsely samples a small number of question-related frames by a differentiable sparse sampling module. Finally, VideoDistill uses a vision refinement module to emphasize necessary multi-scale visual semantics in selected frames.
  • Figure 3: Illustrations of Self-Attention, Cross Attention, and our LA-Gate mechanisms
  • Figure 4: Architectures of Language-Aware Gate, Frame Sampling Block, and Vision Refinement Block.
  • Figure 5: The impact of the number of frames.
  • ...and 3 more figures