Table of Contents
Fetching ...

Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos

Songtao Jiang, Sibo Song, Chenyi Zhou, Yuan Wang, Ruizhe Chen, Tongkun Guan, Ruilin Luo, Yan Zhang, Zhihang Tang, Yuchong Sun, Hang Zhang, Zhibo Yang, Shuai Bai, Junyang Lin, Zuozhu Liu

Abstract

The transition from image to video understanding requires vision-language models (VLMs) to shift from recognizing static patterns to reasoning over temporal dynamics such as motion trajectories, speed changes, and state transitions. Yet current post-training methods fall short due to two critical limitations: (1) existing datasets often lack temporal-centricity, where answers can be inferred from isolated keyframes rather than requiring holistic temporal integration; and (2) training data generated by proprietary models contains systematic errors in fundamental temporal perception, such as confusing motion directions or misjudging speeds. We introduce SynRL, a post-training framework that teaches models temporal primitives, the fundamental building blocks of temporal understanding including direction, speed, and state tracking. Our key insight is that these abstract primitives, learned from programmatically generated synthetic videos, transfer effectively to real-world scenarios. We decompose temporal understanding into short-term perceptual primitives (speed, direction) and long-term cognitive primitives, constructing 7.7K CoT and 7K RL samples with ground-truth frame-level annotations through code-based video generation. Despite training on simple geometric shapes, SynRL achieves substantial improvements across 15 benchmarks spanning temporal grounding, complex reasoning, and general video understanding. Remarkably, our 7.7K synthetic CoT samples outperform Video-R1 with 165K real-world samples. We attribute this to fundamental temporal skills, such as tracking frame by frame changes and comparing velocity, that transfer effectively from abstract synthetic patterns to complex real-world scenarios. This establishes a new paradigm for video post-training: video temporal learning through carefully designed synthetic data provides a more cost efficient scaling path.

Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos

Abstract

The transition from image to video understanding requires vision-language models (VLMs) to shift from recognizing static patterns to reasoning over temporal dynamics such as motion trajectories, speed changes, and state transitions. Yet current post-training methods fall short due to two critical limitations: (1) existing datasets often lack temporal-centricity, where answers can be inferred from isolated keyframes rather than requiring holistic temporal integration; and (2) training data generated by proprietary models contains systematic errors in fundamental temporal perception, such as confusing motion directions or misjudging speeds. We introduce SynRL, a post-training framework that teaches models temporal primitives, the fundamental building blocks of temporal understanding including direction, speed, and state tracking. Our key insight is that these abstract primitives, learned from programmatically generated synthetic videos, transfer effectively to real-world scenarios. We decompose temporal understanding into short-term perceptual primitives (speed, direction) and long-term cognitive primitives, constructing 7.7K CoT and 7K RL samples with ground-truth frame-level annotations through code-based video generation. Despite training on simple geometric shapes, SynRL achieves substantial improvements across 15 benchmarks spanning temporal grounding, complex reasoning, and general video understanding. Remarkably, our 7.7K synthetic CoT samples outperform Video-R1 with 165K real-world samples. We attribute this to fundamental temporal skills, such as tracking frame by frame changes and comparing velocity, that transfer effectively from abstract synthetic patterns to complex real-world scenarios. This establishes a new paradigm for video post-training: video temporal learning through carefully designed synthetic data provides a more cost efficient scaling path.
Paper Structure (17 sections, 6 figures, 11 tables)

This paper contains 17 sections, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Systematic failures in proprietary models' temporal perception. (a) Gemini-2.5-Pro incorrectly describes a simple geometric shape's motion trajectory. (b) When used to annotate real-world videos, such flawed descriptions propagate errors into training data. (Erroneous phrases in red.)
  • Figure 2: Overview of SynRL framework. (Left) Programmatic Generation Pipeline: Object properties and motion dynamics are specified in Python code to generate videos with accurate frame-level metadata. Videos are rendered at 30 FPS, and QA pairs are instantiated from hand-crafted templates conditioned on the same metadata, yielding temporally grounded synthetic video-QA triples. (Middle) Chain-of-Thought Augmentation: Given a synthetic video, its metadata, and its QA pair, a multimodal LLM generates step-by-step reasoning chains. A Judger verifies their consistency with the event timeline, filtering out incorrect outputs. The verified reasoning chains are then passed to a Polisher, which refines them into more natural and fluent CoT annotations while preserving factual correctness. (Right) Training Strategy: The target model is first trained via supervised fine-tuning on the polished CoT data to learn explicit temporal reasoning patterns. It is then optimized with group relative policy optimization (GRPO) on synthetic video-QA samples with verifiable rewards, so that reinforcement learning further improves reasoning quality under strictly correct supervisory signals.
  • Figure 3: (a) Training Examples: We generate diverse synthetic videos spanning 8 major categories with 18 subcategories, covering short-term perceptual primitives (speed perception, motion tracking, direction identification) and long-term cognitive primitives (grid-based object tracking, symbol manipulation, code execution, mathematical operations, container management). Each video is procedurally generated with verifiable ground-truth answers. (b) Constructing CoT using Meta Information: Using a grid-based object tracking game as an example, we provide the video with its code-derived metadata (initial state, swap events, timestamps) to a VLM. The model generates step-by-step reasoning chains that track each object through every swap operation, explicitly referencing frame timestamps (e.g., "at 00:00", "at 00:02") and logging state transitions. This metadata-conditioned generation ensures temporal grounding and verifiable correctness, as the reasoning must align with the documented event sequence.
  • Figure 4: mIoP Performance comparison on NExTGQA and RexTime benchmarks.
  • Figure 5: Real-world motion understanding examples. The base model fails on direction recognition and speed perception, while the SynRL-trained model answers both correctly, indicating that temporal skills acquired from synthetic videos can transfer to real-world video understanding.
  • ...and 1 more figures