Table of Contents
Fetching ...

Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering

Ting Yu, Kunhao Fu, Jian Zhang, Qingming Huang, Jun Yu

TL;DR

This work proposes a Cross-modal Collaborative Generation module to reformulate VideoQA as a generative task instead of the conventional classification scheme, empowering the model with the capability for cross-modal high-semantic fusion and generation so as to rationalize and answer.

Abstract

Long-term Video Question Answering (VideoQA) is a challenging vision-and-language bridging task focusing on semantic understanding of untrimmed long-term videos and diverse free-form questions, simultaneously emphasizing comprehensive cross-modal reasoning to yield precise answers. The canonical approaches often rely on off-the-shelf feature extractors to detour the expensive computation overhead, but often result in domain-independent modality-unrelated representations. Furthermore, the inherent gradient blocking between unimodal comprehension and cross-modal interaction hinders reliable answer generation. In contrast, recent emerging successful video-language pre-training models enable cost-effective end-to-end modeling but fall short in domain-specific ratiocination and exhibit disparities in task formulation. Toward this end, we present an entirely end-to-end solution for long-term VideoQA: Multi-granularity Contrastive cross-modal collaborative Generation (MCG) model. To derive discriminative representations possessing high visual concepts, we introduce Joint Unimodal Modeling (JUM) on a clip-bone architecture and leverage Multi-granularity Contrastive Learning (MCL) to harness the intrinsically or explicitly exhibited semantic correspondences. To alleviate the task formulation discrepancy problem, we propose a Cross-modal Collaborative Generation (CCG) module to reformulate VideoQA as a generative task instead of the conventional classification scheme, empowering the model with the capability for cross-modal high-semantic fusion and generation so as to rationalize and answer. Extensive experiments conducted on six publicly available VideoQA datasets underscore the superiority of our proposed method.

Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering

TL;DR

This work proposes a Cross-modal Collaborative Generation module to reformulate VideoQA as a generative task instead of the conventional classification scheme, empowering the model with the capability for cross-modal high-semantic fusion and generation so as to rationalize and answer.

Abstract

Long-term Video Question Answering (VideoQA) is a challenging vision-and-language bridging task focusing on semantic understanding of untrimmed long-term videos and diverse free-form questions, simultaneously emphasizing comprehensive cross-modal reasoning to yield precise answers. The canonical approaches often rely on off-the-shelf feature extractors to detour the expensive computation overhead, but often result in domain-independent modality-unrelated representations. Furthermore, the inherent gradient blocking between unimodal comprehension and cross-modal interaction hinders reliable answer generation. In contrast, recent emerging successful video-language pre-training models enable cost-effective end-to-end modeling but fall short in domain-specific ratiocination and exhibit disparities in task formulation. Toward this end, we present an entirely end-to-end solution for long-term VideoQA: Multi-granularity Contrastive cross-modal collaborative Generation (MCG) model. To derive discriminative representations possessing high visual concepts, we introduce Joint Unimodal Modeling (JUM) on a clip-bone architecture and leverage Multi-granularity Contrastive Learning (MCL) to harness the intrinsically or explicitly exhibited semantic correspondences. To alleviate the task formulation discrepancy problem, we propose a Cross-modal Collaborative Generation (CCG) module to reformulate VideoQA as a generative task instead of the conventional classification scheme, empowering the model with the capability for cross-modal high-semantic fusion and generation so as to rationalize and answer. Extensive experiments conducted on six publicly available VideoQA datasets underscore the superiority of our proposed method.

Paper Structure

This paper contains 33 sections, 11 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Comparison of Existing VideoQA Paradigms with Our Approach. (a) Canonical models following a two-stage paradigm utilize offline feature extractors to mitigate computational overhead, yet suffer in domain-independent, modality-unrelated, gradient-blocking problems. (b) Video-language pre-training models facilitate affordable end-to-end modeling on the raw cross-modal inputs. However, the extra appended classifier head results in task disparity. (c) Our MCG embodies an entirely end-to-end generative paradigm.
  • Figure 2: The proposed MCG framework comprises three key components: (a) Joint Unimodal Modeling (JUM) operates to derive discriminative representations in the supervision of its complementary modality. (b) Multi-granularity Contrastive Learning strategy (MCL) leverages JUM to exploit multi-granular correspondence, enhancing the generation of high-quality unimodal semantics. MCL incorporates Instance-granularity Contrastive Learning (ICL) to capture global semantic consistency and Token-granularity Contrastive Learning (TCL) to concentrate on often overlooked yet crucial subtle cues. (c) The Cross-modal Collaborative Generation (CCG) module includes a cross-modal fusor enabling deep multimodal information interaction and an answer generator producing answers conditioned on referenced videos. During the question-answering phase, the joint unimodal encoder extracts discriminative unimodal semantics. Subsequently, these semantics pass through the cross-modal fusor to yield fused cross-modal reasoning evidence. Finally, absorbing this evidence, the answer generator generates the answer.
  • Figure 3: Illustration of Intra-Video Model (IVM). (a) Sparse Head-Tail Sampling. (b) Frame Partition. (c) Sparsely Time-Space Divided Attention.
  • Figure 4: In-depth ablation study comparing multi-granularity contrastive learning (w/ ICL&TCL) with uni-granularity contrastive learning (w/o ICL and w/o TCL) across various task types, including temporal relationship (Tem. R.), spatial relationship (Spa. R.), motion, and free tasks. Best viewed in color.
  • Figure 5: Visualization of training and validation accuracy trends across various training epochs for (a) MCG-CH, a model variant employing a Classifier Head, and (b) MCG, the proposed model with language generation capability. Optimal viewing experience in color.
  • ...and 3 more figures