Table of Contents
Fetching ...

SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation

Hao Du, Bo Wu, Yan Lu, Zhendong Mao

TL;DR

This paper addresses the challenge of evaluating vision-language temporal alignment by revealing biases in existing benchmarks and proposing SVLTA, a synthetic, controllable benchmark built in VirtualHome with 96 compositional actions, 25.3K video situations, and 77.1K temporal annotations. It introduces the Temporal Jensen–Shannon Divergence to quantify temporal distribution biases and a pipeline including synthetic situation generation, language generation, and inequality-constrained global filtering to ensure balanced data. Through experiments on temporal question answering, distributional-shift sensitivity, and adaptation, the work reveals substantial gaps in current VidLLMs' temporal alignment capabilities and demonstrates that transformer-based architectures offer stronger transfer to new situations. The findings provide a principled diagnostic framework and practical guidance for developing and evaluating temporally aware vision-language models with fair, unbiased benchmarks.

Abstract

Vision-language temporal alignment is a crucial capability for human dynamic recognition and cognition in real-world scenarios. While existing research focuses on capturing vision-language relevance, it faces limitations due to biased temporal distributions, imprecise annotations, and insufficient compositionally. To achieve fair evaluation and comprehensive exploration, our objective is to investigate and evaluate the ability of models to achieve alignment from a temporal perspective, specifically focusing on their capacity to synchronize visual scenarios with linguistic context in a temporally coherent manner. As a preliminary step, we present the statistical analysis of existing benchmarks and reveal the existing challenges from a decomposed perspective. To this end, we introduce SVLTA, the Synthetic Vision-Language Temporal Alignment derived via a well-designed and feasible control generation method within a simulation environment. The approach considers commonsense knowledge, manipulable action, and constrained filtering, which generates reasonable, diverse, and balanced data distributions for diagnostic evaluations. Our experiments reveal diagnostic insights through the evaluations in temporal question answering, distributional shift sensitiveness, and temporal alignment adaptation.

SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation

TL;DR

This paper addresses the challenge of evaluating vision-language temporal alignment by revealing biases in existing benchmarks and proposing SVLTA, a synthetic, controllable benchmark built in VirtualHome with 96 compositional actions, 25.3K video situations, and 77.1K temporal annotations. It introduces the Temporal Jensen–Shannon Divergence to quantify temporal distribution biases and a pipeline including synthetic situation generation, language generation, and inequality-constrained global filtering to ensure balanced data. Through experiments on temporal question answering, distributional-shift sensitivity, and adaptation, the work reveals substantial gaps in current VidLLMs' temporal alignment capabilities and demonstrates that transformer-based architectures offer stronger transfer to new situations. The findings provide a principled diagnostic framework and practical guidance for developing and evaluating temporally aware vision-language models with fair, unbiased benchmarks.

Abstract

Vision-language temporal alignment is a crucial capability for human dynamic recognition and cognition in real-world scenarios. While existing research focuses on capturing vision-language relevance, it faces limitations due to biased temporal distributions, imprecise annotations, and insufficient compositionally. To achieve fair evaluation and comprehensive exploration, our objective is to investigate and evaluate the ability of models to achieve alignment from a temporal perspective, specifically focusing on their capacity to synchronize visual scenarios with linguistic context in a temporally coherent manner. As a preliminary step, we present the statistical analysis of existing benchmarks and reveal the existing challenges from a decomposed perspective. To this end, we introduce SVLTA, the Synthetic Vision-Language Temporal Alignment derived via a well-designed and feasible control generation method within a simulation environment. The approach considers commonsense knowledge, manipulable action, and constrained filtering, which generates reasonable, diverse, and balanced data distributions for diagnostic evaluations. Our experiments reveal diagnostic insights through the evaluations in temporal question answering, distributional shift sensitiveness, and temporal alignment adaptation.

Paper Structure

This paper contains 20 sections, 1 equation, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overview of the SVLTA benchmark, which consists of synthetic videos, language, and high-quality temporal alignment.
  • Figure 2: Multiple Levels of Temporal Distributions. We sample decomposed semantic constituents in the Charades-STA. The color darkness represents the sample density. The horizontal and vertical axes represent the normalized start and end time points.
  • Figure 3: Overview of the benchmark generation process, which contains (a): Situation Component Initialization defines a series of compositional elements, which includes diverse actions, agents, and situations, (b): Commonsense Activity Graph builds a graph on the activity commonsense and then use the traversal algorithm and re-weighting sampling to acquire various and meaningful logical action chains, (c): Controllable Activity Manuscript operates the actions in logical action chains through different framerates and permutations to obtain the final activity manuscript, thereby balancing the temporal distribution, (d): Synthetic Video and Language Sentence Generation convert the generated activity manuscript to the functional programs and utilize it to generate synthetic videos and sentences, and (e): Visual-Language Temporal Alignment automatically associates the timestamps with the action in the sentence to obtain high-quality annotations.
  • Figure 4: Temporal distributions of beginning or ending times.