Table of Contents
Fetching ...

OSCBench: Benchmarking Object State Change in Text-to-Video Generation

Xianjing Han, Bin Zhu, Shiqi Hu, Franklin Mingzhe Li, Patrick Carrington, Roger Zimmermann, Jingjing Chen

Abstract

Text-to-video (T2V) generation models have made rapid progress in producing visually high-quality and temporally coherent videos. However, existing benchmarks primarily focus on perceptual quality, text-video alignment, or physical plausibility, leaving a critical aspect of action understanding largely unexplored: object state change (OSC) explicitly specified in the text prompt. OSC refers to the transformation of an object's state induced by an action, such as peeling a potato or slicing a lemon. In this paper, we introduce OSCBench, a benchmark specifically designed to assess OSC performance in T2V models. OSCBench is constructed from instructional cooking data and systematically organizes action-object interactions into regular, novel, and compositional scenarios to probe both in-distribution performance and generalization. We evaluate six representative open-source and proprietary T2V models using both human user study and multimodal large language model (MLLM)-based automatic evaluation. Our results show that, despite strong performance on semantic and scene alignment, current T2V models consistently struggle with accurate and temporally consistent object state changes, especially in novel and compositional settings. These findings position OSC as a key bottleneck in text-to-video generation and establish OSCBench as a diagnostic benchmark for advancing state-aware video generation models.

OSCBench: Benchmarking Object State Change in Text-to-Video Generation

Abstract

Text-to-video (T2V) generation models have made rapid progress in producing visually high-quality and temporally coherent videos. However, existing benchmarks primarily focus on perceptual quality, text-video alignment, or physical plausibility, leaving a critical aspect of action understanding largely unexplored: object state change (OSC) explicitly specified in the text prompt. OSC refers to the transformation of an object's state induced by an action, such as peeling a potato or slicing a lemon. In this paper, we introduce OSCBench, a benchmark specifically designed to assess OSC performance in T2V models. OSCBench is constructed from instructional cooking data and systematically organizes action-object interactions into regular, novel, and compositional scenarios to probe both in-distribution performance and generalization. We evaluate six representative open-source and proprietary T2V models using both human user study and multimodal large language model (MLLM)-based automatic evaluation. Our results show that, despite strong performance on semantic and scene alignment, current T2V models consistently struggle with accurate and temporally consistent object state changes, especially in novel and compositional settings. These findings position OSC as a key bottleneck in text-to-video generation and establish OSCBench as a diagnostic benchmark for advancing state-aware video generation models.
Paper Structure (21 sections, 12 figures, 9 tables)

This paper contains 21 sections, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Overview of OSCBench evaluation. (a) Representative failure cases from regular, novel, and compositional object state change scenarios. In the regular case, the red box marks an implausible state change of the lemon during slicing. In the novel case, the model misinterprets the instructed action, resulting in a wrong object transformation. In the compositional case, the yellow box indicates an incomplete state change where the pear remains unpeeled. (b) Human-evaluated multi-dimensional performance of T2V models on OSCBench.
  • Figure 2: Overview of the OSCBench construction and evaluation pipeline. We build unified action and object categories from instructional cooking data via a human-in-the-loop process, and construct regular, novel, and compositional OSC scenarios as text prompts for video generation. The generated videos are evaluated by humans and MLLMs across multiple criteria, and we analyze their correlations to assess automatic evaluation reliability.
  • Figure 3: Overall performance comparison of T2V models based on aggregated evaluation scores from human evaluator and MLLM-based evaluators (Qwen3-VL-30B and GPT-5.2).
  • Figure 4: Sampled video frames generated by different T2V models. State change consistency or noticeable artifacts are highlighted in boxes.
  • Figure 5: Object state change performance across action categories by human evaluation. Scores are averaged over accuracy and consistency.
  • ...and 7 more figures