ACT-Bench: Towards Action Controllable World Models for Autonomous Driving
Hidehisa Arai, Keishi Ishihara, Tsubasa Takahashi, Yu Yamaguchi
TL;DR
ACT-Bench addresses the limited reproducibility of action fidelity in driving world models by providing an open benchmark that pairs short-context nuScenes videos with ground-truth trajectories and an automated ACT-Estimator to quantify instruction adherence. The framework introduces IEC and TA metrics to measure how well generated scenes follow high-level actions and trajectories, enabling systematic, reproducible evaluation. Empirical results show that the state-of-the-art Vista model does not fully follow given instructions, while Terra, a baseline world model trained on large trajectory-annotated datasets, achieves improved action fidelity and can generate diverse, instruction-conditioned scenes, including crash scenarios. By releasing all components publicly, ACT-Bench aims to catalyze further research toward reliable, action-controllable world models for autonomous driving.
Abstract
World models have emerged as promising neural simulators for autonomous driving, with the potential to supplement scarce real-world data and enable closed-loop evaluations. However, current research primarily evaluates these models based on visual realism or downstream task performance, with limited focus on fidelity to specific action instructions - a crucial property for generating targeted simulation scenes. Although some studies address action fidelity, their evaluations rely on closed-source mechanisms, limiting reproducibility. To address this gap, we develop an open-access evaluation framework, ACT-Bench, for quantifying action fidelity, along with a baseline world model, Terra. Our benchmarking framework includes a large-scale dataset pairing short context videos from nuScenes with corresponding future trajectory data, which provides conditional input for generating future video frames and enables evaluation of action fidelity for executed motions. Furthermore, Terra is trained on multiple large-scale trajectory-annotated datasets to enhance action fidelity. Leveraging this framework, we demonstrate that the state-of-the-art model does not fully adhere to given instructions, while Terra achieves improved action fidelity. All components of our benchmark framework will be made publicly available to support future research.
