Table of Contents
Fetching ...

VACT: A Video Automatic Causal Testing System and a Benchmark

Haotong Yang, Qingyuan Zheng, Yunjian Gao, Yongkun Yang, Yangbo He, Zhouchen Lin, Muhan Zhang

TL;DR

The paper tackles the problem of factual and physical inconsistencies in text-conditioned video generation by introducing VACT, a fully automated causal-testing framework. It combines LLM-driven automatic generation of scenario-specific causal graphs and Boolean rules with an intervention-based testing pipeline that uses video generation and vision-language reasoning to assess causal understanding at three progressively harder levels. The authors establish a multi-level metric suite (text, generation, and rule consistency) and create a scalable benchmark by evaluating a broad set of VGMs, revealing pervasive gaps in causal learning and suggesting directions for data augmentation and reinforcement learning alignment. The framework is validated through crowd experiments and human baselines, demonstrating strong alignment with human reasoning and highlighting the need for further improvements to realize reliable world-simulation capabilities in VGMs. Overall, VACT provides a scalable, automated path toward diagnosing and mitigating causal understanding deficiencies in VGMs, contributing to their reliability and real-world applicability.

Abstract

With the rapid advancement of text-conditioned Video Generation Models (VGMs), the quality of generated videos has significantly improved, bringing these models closer to functioning as ``*world simulators*'' and making real-world-level video generation more accessible and cost-effective. However, the generated videos often contain factual inaccuracies and lack understanding of fundamental physical laws. While some previous studies have highlighted this issue in limited domains through manual analysis, a comprehensive solution has not yet been established, primarily due to the absence of a generalized, automated approach for modeling and assessing the causal reasoning of these models across diverse scenarios. To address this gap, we propose VACT: an **automated** framework for modeling, evaluating, and measuring the causal understanding of VGMs in real-world scenarios. By combining causal analysis techniques with a carefully designed large language model assistant, our system can assess the causal behavior of models in various contexts without human annotation, which offers strong generalization and scalability. Additionally, we introduce multi-level causal evaluation metrics to provide a detailed analysis of the causal performance of VGMs. As a demonstration, we use our framework to benchmark several prevailing VGMs, offering insight into their causal reasoning capabilities. Our work lays the foundation for systematically addressing the causal understanding deficiencies in VGMs and contributes to advancing their reliability and real-world applicability.

VACT: A Video Automatic Causal Testing System and a Benchmark

TL;DR

The paper tackles the problem of factual and physical inconsistencies in text-conditioned video generation by introducing VACT, a fully automated causal-testing framework. It combines LLM-driven automatic generation of scenario-specific causal graphs and Boolean rules with an intervention-based testing pipeline that uses video generation and vision-language reasoning to assess causal understanding at three progressively harder levels. The authors establish a multi-level metric suite (text, generation, and rule consistency) and create a scalable benchmark by evaluating a broad set of VGMs, revealing pervasive gaps in causal learning and suggesting directions for data augmentation and reinforcement learning alignment. The framework is validated through crowd experiments and human baselines, demonstrating strong alignment with human reasoning and highlighting the need for further improvements to realize reliable world-simulation capabilities in VGMs. Overall, VACT provides a scalable, automated path toward diagnosing and mitigating causal understanding deficiencies in VGMs, contributing to their reliability and real-world applicability.

Abstract

With the rapid advancement of text-conditioned Video Generation Models (VGMs), the quality of generated videos has significantly improved, bringing these models closer to functioning as ``*world simulators*'' and making real-world-level video generation more accessible and cost-effective. However, the generated videos often contain factual inaccuracies and lack understanding of fundamental physical laws. While some previous studies have highlighted this issue in limited domains through manual analysis, a comprehensive solution has not yet been established, primarily due to the absence of a generalized, automated approach for modeling and assessing the causal reasoning of these models across diverse scenarios. To address this gap, we propose VACT: an **automated** framework for modeling, evaluating, and measuring the causal understanding of VGMs in real-world scenarios. By combining causal analysis techniques with a carefully designed large language model assistant, our system can assess the causal behavior of models in various contexts without human annotation, which offers strong generalization and scalability. Additionally, we introduce multi-level causal evaluation metrics to provide a detailed analysis of the causal performance of VGMs. As a demonstration, we use our framework to benchmark several prevailing VGMs, offering insight into their causal reasoning capabilities. Our work lays the foundation for systematically addressing the causal understanding deficiencies in VGMs and contributes to advancing their reliability and real-world applicability.

Paper Structure

This paper contains 53 sections, 18 equations, 19 figures, 19 tables.

Figures (19)

  • Figure 1: Videos generated by OpenAI Sora, shown as frames. The text prompt of the Above is: a stone is thrown into a swimming pool; Below is: a feather is thrown into a swimming pool. Both the generations show noticeable splashes, which is correct for the above (stone) scene but incorrect for the below (feather) scene.
  • Figure 2: An example causal graph and system: "throwing something into a swimming pool". Blue denotes root nodes and orange denotes non-root nodes. Physical explanation can be found in dropimpact Figure 6.
  • Figure 3: Pipeline of VACT. The pipeline mainly consists of four parts: causal system (i.e. test case) proposal (yellow), text prompt generation, (green), video generation (blue), answer retrieval and evaluation (pink). The pipeline receive a sentence describe a scenario as input and automatically evaluate video generation models without any human supervision or annotation.
  • Figure 4: Videos generated by (a) OpenAI Sora, (b) CogVideoX-2 and (c) Gen-3 Alpha, shown as frames. For each model, the text prompt of the Above is: a stone is thrown into a swimming pool; Below is: a feather is thrown into a swimming pool. Both generation show noticeable splashes, which is correct for the above (stone) scene but incorrect for the below (feather) scene.
  • Figure 5: The violin plot as detailed distribution of 5 scorers. The width shows the number of the samples. The x-axis represents the 5 annotators.
  • ...and 14 more figures

Theorems & Definitions (1)

  • Definition 1: Causal graph and system causality2009pearl