Table of Contents
Fetching ...

Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion

Ishaan Singh Rawal, Alexander Matyasko, Shantanu Jaiswal, Basura Fernando, Cheston Tan

TL;DR

The paper tackles whether VideoQA transformers truly learn joint multimodal representations or rely on dataset biases. It introduces QUAG, a lightweight, non-parametric probe that impairs modality fusion by quadrant-wise averaging in attention and evaluates combined dataset-model representations without finetuning. It further develops QUAG-attention, a restricted attention variant, and CLAVI, a stress-test dataset designed to enforce high modality coupling. Across real datasets and synthetic simulations, QUAG reveals that high benchmark accuracy often does not reflect coupled multimodal understanding, while QUAG-attention can drastically reduce computation with minimal loss on several tasks. Together, QUAG and CLAVI uncover brittleness in current VideoQA models and advocate for diagnostic benchmarks that properly stress highly-coupled multimodal representations, with $ ho(m{ ilde{A}}) ext{ bounds}$ illustrating how short-circuiting constrains representation power.

Abstract

While VideoQA Transformer models demonstrate competitive performance on standard benchmarks, the reasons behind their success are not fully understood. Do these models capture the rich multimodal structures and dynamics from video and text jointly? Or are they achieving high scores by exploiting biases and spurious features? Hence, to provide insights, we design $\textit{QUAG}$ (QUadrant AveraGe), a lightweight and non-parametric probe, to conduct dataset-model combined representation analysis by impairing modality fusion. We find that the models achieve high performance on many datasets without leveraging multimodal representations. To validate QUAG further, we design $\textit{QUAG-attention}$, a less-expressive replacement of self-attention with restricted token interactions. Models with QUAG-attention achieve similar performance with significantly fewer multiplication operations without any finetuning. Our findings raise doubts about the current models' abilities to learn highly-coupled multimodal representations. Hence, we design the $\textit{CLAVI}$ (Complements in LAnguage and VIdeo) dataset, a stress-test dataset curated by augmenting real-world videos to have high modality coupling. Consistent with the findings of QUAG, we find that most of the models achieve near-trivial performance on CLAVI. This reasserts the limitations of current models for learning highly-coupled multimodal representations, that is not evaluated by the current datasets (project page: https://dissect-videoqa.github.io ).

Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion

TL;DR

The paper tackles whether VideoQA transformers truly learn joint multimodal representations or rely on dataset biases. It introduces QUAG, a lightweight, non-parametric probe that impairs modality fusion by quadrant-wise averaging in attention and evaluates combined dataset-model representations without finetuning. It further develops QUAG-attention, a restricted attention variant, and CLAVI, a stress-test dataset designed to enforce high modality coupling. Across real datasets and synthetic simulations, QUAG reveals that high benchmark accuracy often does not reflect coupled multimodal understanding, while QUAG-attention can drastically reduce computation with minimal loss on several tasks. Together, QUAG and CLAVI uncover brittleness in current VideoQA models and advocate for diagnostic benchmarks that properly stress highly-coupled multimodal representations, with illustrating how short-circuiting constrains representation power.

Abstract

While VideoQA Transformer models demonstrate competitive performance on standard benchmarks, the reasons behind their success are not fully understood. Do these models capture the rich multimodal structures and dynamics from video and text jointly? Or are they achieving high scores by exploiting biases and spurious features? Hence, to provide insights, we design (QUadrant AveraGe), a lightweight and non-parametric probe, to conduct dataset-model combined representation analysis by impairing modality fusion. We find that the models achieve high performance on many datasets without leveraging multimodal representations. To validate QUAG further, we design , a less-expressive replacement of self-attention with restricted token interactions. Models with QUAG-attention achieve similar performance with significantly fewer multiplication operations without any finetuning. Our findings raise doubts about the current models' abilities to learn highly-coupled multimodal representations. Hence, we design the (Complements in LAnguage and VIdeo) dataset, a stress-test dataset curated by augmenting real-world videos to have high modality coupling. Consistent with the findings of QUAG, we find that most of the models achieve near-trivial performance on CLAVI. This reasserts the limitations of current models for learning highly-coupled multimodal representations, that is not evaluated by the current datasets (project page: https://dissect-videoqa.github.io ).
Paper Structure (41 sections, 2 theorems, 8 equations, 8 figures, 16 tables)

This paper contains 41 sections, 2 theorems, 8 equations, 8 figures, 16 tables.

Key Result

Theorem 2.1

Unimodal short-circuiting produces unimodal average conformable features in all the fusion blocks.

Figures (8)

  • Figure 1: Illustrative toy example of row-wise average-and-replace operation or $\mathcal{R}( \bm{A}, \mathcal{VV})$, where $\bm{A}$ is the input attention matrix (left). The cells are colored as per their quadrants ($\mathcal{VV: \textrm{red}}, \mathcal{VT: \textrm{yellow}}, \mathcal{TV: \textrm{blue}}, \mathcal{TT: \textrm{green}}$). We apply $\mathcal{R}$ operator on the $\mathcal{VV}$ quadrant (highlighted in yellow) to replace the values with the respective row-wise average value (right).
  • Figure 2: Result of the simulation study. The plot of percentage increase in the mean squared loss after crossmodal short-circuiting (y) versus $\alpha$, the crossmodal coupling coefficient (x).
  • Figure 3: Illustrative example of the creation of CLAVI. In the original video (V), the action "holding clothes" (Event A; blue pane) follows "taking food" (Event B; brown pane). To create a complement video (V'), we swap the action segments without manipulating the segment separating them. The questions (Q), along with their complement (Q'), are curated for each of the videos. Note that the color of the question panel reflects the correct answer (green for "yes", pink for "no"). We provide the list of questions in Table \ref{['table:clavi_eg']}.
  • Figure 4: Toy example of $\phi(Z, [\mathcal{TT, VV}])$, where $Z$ is the input (left-most matrix) $\mathcal{R}$ is the row-wise average and replace operator and hatching denotes padding. The quadrants that are operated on are highlighted in bright yellow box. Note that $L_\mathcal{V} = 3$ and $L_\mathcal{T} = 2$, such that video embeddings are pre-concatenated to question embeddings (as in the main manuscript). The cells are colored as per their quadrants ($\mathcal{VV: \textrm{red}}, \mathcal{VT: \textrm{yellow}}, \mathcal{TV: \textrm{blue}}, \mathcal{TT: \textrm{green}}$)
  • Figure 5: Visualization of the first attention head, as a heatmap, from the second layer of JustAsk model with $l_\mathcal{V} = 20$ and $l_\mathcal{T} = 20$. Note that here the text embeddings are pre-concatenated to the video embedding in the input. The lengths of the video and text tokens are 9 and 7 respectively. The text and video tokens are individually padded to length 20 each. We visualize (a) the original attention values and (b)-(d) after short-circuiting (SC) operations.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Theorem 2.1
  • proof
  • Lemma 2.2
  • proof