SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation
Hao Du, Bo Wu, Yan Lu, Zhendong Mao
TL;DR
This paper addresses the challenge of evaluating vision-language temporal alignment by revealing biases in existing benchmarks and proposing SVLTA, a synthetic, controllable benchmark built in VirtualHome with 96 compositional actions, 25.3K video situations, and 77.1K temporal annotations. It introduces the Temporal Jensen–Shannon Divergence to quantify temporal distribution biases and a pipeline including synthetic situation generation, language generation, and inequality-constrained global filtering to ensure balanced data. Through experiments on temporal question answering, distributional-shift sensitivity, and adaptation, the work reveals substantial gaps in current VidLLMs' temporal alignment capabilities and demonstrates that transformer-based architectures offer stronger transfer to new situations. The findings provide a principled diagnostic framework and practical guidance for developing and evaluating temporally aware vision-language models with fair, unbiased benchmarks.
Abstract
Vision-language temporal alignment is a crucial capability for human dynamic recognition and cognition in real-world scenarios. While existing research focuses on capturing vision-language relevance, it faces limitations due to biased temporal distributions, imprecise annotations, and insufficient compositionally. To achieve fair evaluation and comprehensive exploration, our objective is to investigate and evaluate the ability of models to achieve alignment from a temporal perspective, specifically focusing on their capacity to synchronize visual scenarios with linguistic context in a temporally coherent manner. As a preliminary step, we present the statistical analysis of existing benchmarks and reveal the existing challenges from a decomposed perspective. To this end, we introduce SVLTA, the Synthetic Vision-Language Temporal Alignment derived via a well-designed and feasible control generation method within a simulation environment. The approach considers commonsense knowledge, manipulable action, and constrained filtering, which generates reasonable, diverse, and balanced data distributions for diagnostic evaluations. Our experiments reveal diagnostic insights through the evaluations in temporal question answering, distributional shift sensitiveness, and temporal alignment adaptation.
