Large Language Models are Temporal and Causal Reasoners for Video Question Answering

Dohwan Ko; Ji Soo Lee; Wooyoung Kang; Byungseok Roh; Hyunwoo J. Kim

Large Language Models are Temporal and Causal Reasoners for Video Question Answering

Dohwan Ko, Ji Soo Lee, Wooyoung Kang, Byungseok Roh, Hyunwoo J. Kim

TL;DR

VideoQA requires temporal and causal reasoning, but strong LLM priors can lead to ungrounded, linguistically biased answers. The authors introduce Flipped-VQA, a training framework that uses three auxiliary tasks (VQ -> A, VA -> Q, QA -> V) to coax decoder-only LLMs to reason about video content and its relation to language, implemented in the LLaMA-VQA architecture with a lightweight adapter and a CLIP visual encoder. Flipped-VQA yields consistent performance gains across five challenging benchmarks and multiple models, while also mitigating linguistic bias by grounding answers in visual information. The method is parameter-efficient and generalizable to other LLMs, with code available for broader adoption.

Abstract

Large Language Models (LLMs) have shown remarkable performances on a wide range of natural language understanding and generation tasks. We observe that the LLMs provide effective priors in exploiting $\textit{linguistic shortcuts}$ for temporal and causal reasoning in Video Question Answering (VideoQA). However, such priors often cause suboptimal results on VideoQA by leading the model to over-rely on questions, $\textit{i.e.}$, $\textit{linguistic bias}$, while ignoring visual content. This is also known as `ungrounded guesses' or `hallucinations'. To address this problem while leveraging LLMs' prior on VideoQA, we propose a novel framework, Flipped-VQA, encouraging the model to predict all the combinations of $\langle$V, Q, A$\rangle$ triplet by flipping the source pair and the target label to understand their complex relationships, $\textit{i.e.}$, predict A, Q, and V given a VQ, VA, and QA pairs, respectively. In this paper, we develop LLaMA-VQA by applying Flipped-VQA to LLaMA, and it outperforms both LLMs-based and non-LLMs-based models on five challenging VideoQA benchmarks. Furthermore, our Flipped-VQA is a general framework that is applicable to various LLMs (OPT and GPT-J) and consistently improves their performances. We empirically demonstrate that Flipped-VQA not only enhances the exploitation of linguistic shortcuts but also mitigates the linguistic bias, which causes incorrect answers over-relying on the question. Code is available at https://github.com/mlvlab/Flipped-VQA.

Large Language Models are Temporal and Causal Reasoners for Video Question Answering

TL;DR

Abstract

Large Language Models (LLMs) have shown remarkable performances on a wide range of natural language understanding and generation tasks. We observe that the LLMs provide effective priors in exploiting

for temporal and causal reasoning in Video Question Answering (VideoQA). However, such priors often cause suboptimal results on VideoQA by leading the model to over-rely on questions,

, while ignoring visual content. This is also known as `ungrounded guesses' or `hallucinations'. To address this problem while leveraging LLMs' prior on VideoQA, we propose a novel framework, Flipped-VQA, encouraging the model to predict all the combinations of

V, Q, A

triplet by flipping the source pair and the target label to understand their complex relationships,

, predict A, Q, and V given a VQ, VA, and QA pairs, respectively. In this paper, we develop LLaMA-VQA by applying Flipped-VQA to LLaMA, and it outperforms both LLMs-based and non-LLMs-based models on five challenging VideoQA benchmarks. Furthermore, our Flipped-VQA is a general framework that is applicable to various LLMs (OPT and GPT-J) and consistently improves their performances. We empirically demonstrate that Flipped-VQA not only enhances the exploitation of linguistic shortcuts but also mitigates the linguistic bias, which causes incorrect answers over-relying on the question. Code is available at https://github.com/mlvlab/Flipped-VQA.

Paper Structure (18 sections, 17 equations, 9 figures, 10 tables)

This paper contains 18 sections, 17 equations, 9 figures, 10 tables.

Introduction
Related works
Method
LLaMA-VQA
Flipped-VQA
Experiments
Temporal and causal reasoning of LLMs
Flipped-VQA on challenging VideoQA
Flipped-VQA for mitigating linguistic bias
Conclusion
Dataset details
Implementation details
LLaMA-Adapter
NExT-QA
Discussion on Flipped-VQA
...and 3 more sections

Figures (9)

Figure 1: LLMs' temporal and causal reasoning ability. (a) An example of a causal question that LLMs correctly answer without visual content. (b) Comparison of LLaMA 33B (QA) and OPT 125M (VQA).
Figure 2: Illustration of LLMs with Flipped-VQA. Flipped-VQA consists of three objectives: $\mathcal{L}_\text{vqa}$, $\mathcal{L}_\text{vaq}$, and $\mathcal{L}_\text{qav}$. $\mathcal{L}_\text{vqa}$ is a common objective, which predicts the answer given a video-question pair, for VideoQA. Likewise, $\mathcal{L}_\text{vaq}$ and $\mathcal{L}_\text{qav}$ are the objectives for question and video prediction by leveraging LLMs' knowledge. In other words, for each objective, VQ, VA, and QA pair is used as prefix tokens to predict A, Q, and V, respectively. Trainable parameters interleaved in LLMs stand for adapter tokens as in LLaMA-Adapter. Our framework employs only a relatively small number of trainable parameters on LLMs, e.g., 4.5M trainable parameters among the total parameters of LLaMA 7B (0.06%).
Figure 3: Performances of LLMs on three question types of NExT-QA. Performances of various sizes of OPT (125M $\sim$ 6.7B) and LLaMA (7B $\sim$ 33B) are reported. A VideoQA approach with a larger language model achieves a better performance in both VQA and QA settings. Surprisingly, the QA approach with LLaMA (33B) outperforms VQA models with OPT (125M $\sim$ 6.7B) in temporal and causal reasoning.
Figure 4: Examples of question generation.
Figure 5: Examples of alleviation on linguistic bias.
...and 4 more figures

Large Language Models are Temporal and Causal Reasoners for Video Question Answering

TL;DR

Abstract

Large Language Models are Temporal and Causal Reasoners for Video Question Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (9)