Table of Contents
Fetching ...

REVEAL: Relation-based Video Representation Learning for Video-Question-Answering

Sofian Chaybouti, Walid Bousselham, Moritz Wolter, Hilde Kuehne

TL;DR

REVEAL tackles VideoQA by moving beyond global video-text alignment to a structured, relation-based video representation. It converts video captions into sets of subject-predicate-object triplets, uses a Q-former to generate vision queries from video frames, and aligns these queries with text-derived relation embeddings via a Many-to-Many Noise Contrastive Estimation loss. The framework includes a slow-fast dual-pathway processing scheme and leverages Llama adapters to fuse relation embeddings with large language models for QA tasks. Across five benchmarks, REVEAL demonstrates strong temporal and relational reasoning, with ablations showing the value of relation-based supervision, the number of relations used, and the importance of initialization and dual-pathway design. The work advances practical, scalable video understanding by integrating explicitly modeled relations with open-ended language priors, enabling improved VideoQA performance on diverse content.

Abstract

Video-Question-Answering (VideoQA) comprises the capturing of complex visual relation changes over time, remaining a challenge even for advanced Video Language Models (VLM), i.a., because of the need to represent the visual content to a reasonably sized input for those models. To address this problem, we propose RElation-based Video rEpresentAtion Learning (REVEAL), a framework designed to capture visual relation information by encoding them into structured, decomposed representations. Specifically, inspired by spatiotemporal scene graphs, we propose to encode video sequences as sets of relation triplets in the form of (\textit{subject-predicate-object}) over time via their language embeddings. To this end, we extract explicit relations from video captions and introduce a Many-to-Many Noise Contrastive Estimation (MM-NCE) together with a Q-Former architecture to align an unordered set of video-derived queries with corresponding text-based relation descriptions. At inference, the resulting Q-former produces an efficient token representation that can serve as input to a VLM for VideoQA. We evaluate the proposed framework on five challenging benchmarks: NeXT-QA, Intent-QA, STAR, VLEP, and TVQA. It shows that the resulting query-based video representation is able to outperform global alignment-based CLS or patch token representations and achieves competitive results against state-of-the-art models, particularly on tasks requiring temporal reasoning and relation comprehension. The code and models will be publicly released.

REVEAL: Relation-based Video Representation Learning for Video-Question-Answering

TL;DR

REVEAL tackles VideoQA by moving beyond global video-text alignment to a structured, relation-based video representation. It converts video captions into sets of subject-predicate-object triplets, uses a Q-former to generate vision queries from video frames, and aligns these queries with text-derived relation embeddings via a Many-to-Many Noise Contrastive Estimation loss. The framework includes a slow-fast dual-pathway processing scheme and leverages Llama adapters to fuse relation embeddings with large language models for QA tasks. Across five benchmarks, REVEAL demonstrates strong temporal and relational reasoning, with ablations showing the value of relation-based supervision, the number of relations used, and the importance of initialization and dual-pathway design. The work advances practical, scalable video understanding by integrating explicitly modeled relations with open-ended language priors, enabling improved VideoQA performance on diverse content.

Abstract

Video-Question-Answering (VideoQA) comprises the capturing of complex visual relation changes over time, remaining a challenge even for advanced Video Language Models (VLM), i.a., because of the need to represent the visual content to a reasonably sized input for those models. To address this problem, we propose RElation-based Video rEpresentAtion Learning (REVEAL), a framework designed to capture visual relation information by encoding them into structured, decomposed representations. Specifically, inspired by spatiotemporal scene graphs, we propose to encode video sequences as sets of relation triplets in the form of (\textit{subject-predicate-object}) over time via their language embeddings. To this end, we extract explicit relations from video captions and introduce a Many-to-Many Noise Contrastive Estimation (MM-NCE) together with a Q-Former architecture to align an unordered set of video-derived queries with corresponding text-based relation descriptions. At inference, the resulting Q-former produces an efficient token representation that can serve as input to a VLM for VideoQA. We evaluate the proposed framework on five challenging benchmarks: NeXT-QA, Intent-QA, STAR, VLEP, and TVQA. It shows that the resulting query-based video representation is able to outperform global alignment-based CLS or patch token representations and achieves competitive results against state-of-the-art models, particularly on tasks requiring temporal reasoning and relation comprehension. The code and models will be publicly released.

Paper Structure

This paper contains 32 sections, 6 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Relation extraction pipeline: Mistral-7B decomposes WebVid-2M captions into (subject-predicate-object) triplets.
  • Figure 2: REVEAL architecture for relation-based video representation learning. The model processes videos through dual pathways: a Fast Pathway (16 frames) for global context and a Slow Pathway (4 frames) for spatial details. Key components include Vision Encoders (CLIP ViT), Temporal Encoders (transformers), Relation Q-formers, and a Relation Encoder (Sentence-RoBERTa). The training uses our MM-NCE loss to align vision queries with text-derived relation triplets.
  • Figure 3: Overview of the VideoQA finetuning approach. The framework integrates pre-trained relation embeddings from our model with LLMs via adapters.
  • Figure 4: Successful examples from STAR dataset demonstrating REVEAL's relationship alignment capabilities. Top: The model correctly identifies concurrent actions (eating sandwich while taking blanket). Bottom: The model successfully captures temporal ordering of actions (sitting at table before opening door). Alignment scores between extracted relationships and video segments are visualized, showing stronger alignment during relevant temporal windows.
  • Figure 5: Failure cases from STAR dataset highlighting REVEAL's limitations. Top: Question ambiguity leads to multiple valid interpretations of the same action sequence. Bottom: Object recognition challenge where the model defaults to common-sense assumptions about closet contents rather than recognizing the specific object (small box).
  • ...and 2 more figures