Table of Contents
Fetching ...

ACT-Bench: Towards Action Controllable World Models for Autonomous Driving

Hidehisa Arai, Keishi Ishihara, Tsubasa Takahashi, Yu Yamaguchi

TL;DR

ACT-Bench addresses the limited reproducibility of action fidelity in driving world models by providing an open benchmark that pairs short-context nuScenes videos with ground-truth trajectories and an automated ACT-Estimator to quantify instruction adherence. The framework introduces IEC and TA metrics to measure how well generated scenes follow high-level actions and trajectories, enabling systematic, reproducible evaluation. Empirical results show that the state-of-the-art Vista model does not fully follow given instructions, while Terra, a baseline world model trained on large trajectory-annotated datasets, achieves improved action fidelity and can generate diverse, instruction-conditioned scenes, including crash scenarios. By releasing all components publicly, ACT-Bench aims to catalyze further research toward reliable, action-controllable world models for autonomous driving.

Abstract

World models have emerged as promising neural simulators for autonomous driving, with the potential to supplement scarce real-world data and enable closed-loop evaluations. However, current research primarily evaluates these models based on visual realism or downstream task performance, with limited focus on fidelity to specific action instructions - a crucial property for generating targeted simulation scenes. Although some studies address action fidelity, their evaluations rely on closed-source mechanisms, limiting reproducibility. To address this gap, we develop an open-access evaluation framework, ACT-Bench, for quantifying action fidelity, along with a baseline world model, Terra. Our benchmarking framework includes a large-scale dataset pairing short context videos from nuScenes with corresponding future trajectory data, which provides conditional input for generating future video frames and enables evaluation of action fidelity for executed motions. Furthermore, Terra is trained on multiple large-scale trajectory-annotated datasets to enhance action fidelity. Leveraging this framework, we demonstrate that the state-of-the-art model does not fully adhere to given instructions, while Terra achieves improved action fidelity. All components of our benchmark framework will be made publicly available to support future research.

ACT-Bench: Towards Action Controllable World Models for Autonomous Driving

TL;DR

ACT-Bench addresses the limited reproducibility of action fidelity in driving world models by providing an open benchmark that pairs short-context nuScenes videos with ground-truth trajectories and an automated ACT-Estimator to quantify instruction adherence. The framework introduces IEC and TA metrics to measure how well generated scenes follow high-level actions and trajectories, enabling systematic, reproducible evaluation. Empirical results show that the state-of-the-art Vista model does not fully follow given instructions, while Terra, a baseline world model trained on large trajectory-annotated datasets, achieves improved action fidelity and can generate diverse, instruction-conditioned scenes, including crash scenarios. By releasing all components publicly, ACT-Bench aims to catalyze further research toward reliable, action-controllable world models for autonomous driving.

Abstract

World models have emerged as promising neural simulators for autonomous driving, with the potential to supplement scarce real-world data and enable closed-loop evaluations. However, current research primarily evaluates these models based on visual realism or downstream task performance, with limited focus on fidelity to specific action instructions - a crucial property for generating targeted simulation scenes. Although some studies address action fidelity, their evaluations rely on closed-source mechanisms, limiting reproducibility. To address this gap, we develop an open-access evaluation framework, ACT-Bench, for quantifying action fidelity, along with a baseline world model, Terra. Our benchmarking framework includes a large-scale dataset pairing short context videos from nuScenes with corresponding future trajectory data, which provides conditional input for generating future video frames and enables evaluation of action fidelity for executed motions. Furthermore, Terra is trained on multiple large-scale trajectory-annotated datasets to enhance action fidelity. Leveraging this framework, we demonstrate that the state-of-the-art model does not fully adhere to given instructions, while Terra achieves improved action fidelity. All components of our benchmark framework will be made publicly available to support future research.

Paper Structure

This paper contains 33 sections, 7 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: ACT-Bench: Action Controllability Test Benchmark. This benchmark suite evaluates action fidelity in driving world models by utilizing a unique dataset consisting of prior frames paired with ground-truth future trajectories. It enables systematic assessment of action fidelity and comparison with the novel world model, Terra.
  • Figure 2: ACT-Bench assesses the action controllability of world models by estimating actions, trajectories, and their deviations from the generated driving scenes using our motion estimator, ACT-Estimator. In the upper example, Terra successfully follows the instruction to "curving to left." In contrast, the lower example illustrates that Vista fails to follow the instruction. This evaluation helps to identify cases where driving world models do not adhere to the given action instructions, and compare performance for different models.
  • Figure 3: Terra's architecture overview.Terra follows the same design philosophy as GAIA-1 hu2023gaia but omits text conditioning capability and the use of video decoder to maintain simplicity.
  • Figure 4: Prediction accuracy of the ACT-Estimator is visualized in this matrix across high-level actions on the validation dataset. Diagonal values indicate correct predictions, while off-diagonal represent mismatches. The overall accuracy is 94.03%.
  • Figure 5: Examples of Estimated Vehicle Trajectories. DROID-SLAM (DS) tends to overestimate the trajectory length, especially in curving and straight at high-speed scenarios. Our model demonstrates higher alignment with the ground truth (GT) trajectory.
  • ...and 7 more figures