Table of Contents
Fetching ...

CycliST: A Video Language Model Benchmark for Reasoning on Cyclical State Transitions

Simon Kohaut, Daniel Ochs, Shun Zhang, Benedict Flade, Julian Eggert, Kristian Kersting, Devendra Singh Dhami

TL;DR

CycliST introduces a synthetic, high-resolution video benchmark focused on cyclical state transitions to probe temporal reasoning in video-language models. It defines a formal framework with time-dynamic objects, motion and attribute cycles, and scene-wide light cycles, rendered via Blender Cycles and annotated with ground-truth trajectories. The dataset uses template-based question generation with temporal quantifiers and two QA categories to assess both local temporal reasoning and global scene understanding. Experimental results show current SOTA VLMs struggle with cyclic dynamics, particularly in counting, orbit comprehension, and frame-level timing, highlighting a critical gap and setting the stage for future research into durable temporal reasoning and efficiency in dynamic environments.

Abstract

We present CycliST, a novel benchmark dataset designed to evaluate Video Language Models (VLM) on their ability for textual reasoning over cyclical state transitions. CycliST captures fundamental aspects of real-world processes by generating synthetic, richly structured video sequences featuring periodic patterns in object motion and visual attributes. CycliST employs a tiered evaluation system that progressively increases difficulty through variations in the number of cyclic objects, scene clutter, and lighting conditions, challenging state-of-the-art models on their spatio-temporal cognition. We conduct extensive experiments with current state-of-the-art VLMs, both open-source and proprietary, and reveal their limitations in generalizing to cyclical dynamics such as linear and orbital motion, as well as time-dependent changes in visual attributes like color and scale. Our results demonstrate that present-day VLMs struggle to reliably detect and exploit cyclic patterns, lack a notion of temporal understanding, and are unable to extract quantitative insights from scenes, such as the number of objects in motion, highlighting a significant technical gap that needs to be addressed. More specifically, we find no single model consistently leads in performance: neither size nor architecture correlates strongly with outcomes, and no model succeeds equally well across all tasks. By providing a targeted challenge and a comprehensive evaluation framework, CycliST paves the way for visual reasoning models that surpass the state-of-the-art in understanding periodic patterns.

CycliST: A Video Language Model Benchmark for Reasoning on Cyclical State Transitions

TL;DR

CycliST introduces a synthetic, high-resolution video benchmark focused on cyclical state transitions to probe temporal reasoning in video-language models. It defines a formal framework with time-dynamic objects, motion and attribute cycles, and scene-wide light cycles, rendered via Blender Cycles and annotated with ground-truth trajectories. The dataset uses template-based question generation with temporal quantifiers and two QA categories to assess both local temporal reasoning and global scene understanding. Experimental results show current SOTA VLMs struggle with cyclic dynamics, particularly in counting, orbit comprehension, and frame-level timing, highlighting a critical gap and setting the stage for future research into durable temporal reasoning and efficiency in dynamic environments.

Abstract

We present CycliST, a novel benchmark dataset designed to evaluate Video Language Models (VLM) on their ability for textual reasoning over cyclical state transitions. CycliST captures fundamental aspects of real-world processes by generating synthetic, richly structured video sequences featuring periodic patterns in object motion and visual attributes. CycliST employs a tiered evaluation system that progressively increases difficulty through variations in the number of cyclic objects, scene clutter, and lighting conditions, challenging state-of-the-art models on their spatio-temporal cognition. We conduct extensive experiments with current state-of-the-art VLMs, both open-source and proprietary, and reveal their limitations in generalizing to cyclical dynamics such as linear and orbital motion, as well as time-dependent changes in visual attributes like color and scale. Our results demonstrate that present-day VLMs struggle to reliably detect and exploit cyclic patterns, lack a notion of temporal understanding, and are unable to extract quantitative insights from scenes, such as the number of objects in motion, highlighting a significant technical gap that needs to be addressed. More specifically, we find no single model consistently leads in performance: neither size nor architecture correlates strongly with outcomes, and no model succeeds equally well across all tasks. By providing a targeted challenge and a comprehensive evaluation framework, CycliST paves the way for visual reasoning models that surpass the state-of-the-art in understanding periodic patterns.

Paper Structure

This paper contains 29 sections, 2 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: CycliST: A diagnostic Video Question Answering and Scene Understanding benchmark for Video Language Models. As CycliST's scenes underly periodic and smooth changes in position and visual attributes, they always return to each configuration at regular intervals.
  • Figure 2: Visualizing cyclical state transitions in space: We show the spatial relations between objects in an exemplary scene (a) over time (b). While some relations are constant, others are affected by the scene's motion cycles and exhibit a periodic pattern as well.
  • Figure 3: CycliST's question categorization model: We categorize CycliST's questions into two broader categories of question types: temporal descriptive and scene representative. The former challenges VLMs not only to understand a scene, but also to determine if an answer is always true or only at some point in time. The latter tasks VLMs with both understanding the presented cycles themselves and extracting quantitative properties, such as the number of cycles or their periodicity.
  • Figure 4: CycliST's question generation pipeline: Given a question template and scene graph, CycliST samples a question instance and yields a ground-truth answer through a functional program applied to the scene graph. Here, two example question-answer pairs are derived based on a CycliST scene (a), one being an existential temporal query (b) and one universal temporal relation (c).
  • Figure 5: CycliST's evaluation pipelines: By employing an LLM-judge in its VQA (a) and scene understanding (b) pipelines, VLMs can give free-form answers rather than being limited to a multiple-choice questionnaire.
  • ...and 4 more figures