Table of Contents
Fetching ...

FAR-Drive: Frame-AutoRegressive Video Generation in Closed-Loop Autonomous Driving

Yaoru Li, Federico Landi, Marco Godi, Xin Jin, Ruiju Fu, Yufei Ma, Muyang Sun, Heyu Si, Qi Guo

Abstract

Despite rapid progress in autonomous driving, reliable training and evaluation of driving systems remain fundamentally constrained by the lack of scalable and interactive simulation environments. Recent generative video models achieve remarkable visual fidelity, yet most operate in open-loop settings and fail to support fine-grained frame-level interaction between agent actions and environment evolution. Building a learning-based closed-loop simulator for autonomous driving poses three major challenges: maintaining long-horizon temporal and cross-view consistency, mitigating autoregressive degradation under iterative self-conditioning, and satisfying low-latency inference constraints. In this work, we propose FAR-Drive, a frame-level autoregressive video generation framework for autonomous driving. We introduce a multi-view diffusion transformer with fine-grained structured control, enabling geometrically consistent multi-camera generation. To address long-horizon consistency and iterative degradation, we design a two-stage training strategy consisting of adaptive reference horizon conditioning and blend-forcing autoregressive training, which progressively improves consistency and robustness under self-conditioning. To meet low-latency interaction requirements, we further integrate system-level efficiency optimizations for inference acceleration. Experiments on the nuScenes dataset demonstrate that our method achieves state-of-the-art performance among existing closed-loop autonomous driving simulation approaches, while maintaining sub-second latency on a single GPU.

FAR-Drive: Frame-AutoRegressive Video Generation in Closed-Loop Autonomous Driving

Abstract

Despite rapid progress in autonomous driving, reliable training and evaluation of driving systems remain fundamentally constrained by the lack of scalable and interactive simulation environments. Recent generative video models achieve remarkable visual fidelity, yet most operate in open-loop settings and fail to support fine-grained frame-level interaction between agent actions and environment evolution. Building a learning-based closed-loop simulator for autonomous driving poses three major challenges: maintaining long-horizon temporal and cross-view consistency, mitigating autoregressive degradation under iterative self-conditioning, and satisfying low-latency inference constraints. In this work, we propose FAR-Drive, a frame-level autoregressive video generation framework for autonomous driving. We introduce a multi-view diffusion transformer with fine-grained structured control, enabling geometrically consistent multi-camera generation. To address long-horizon consistency and iterative degradation, we design a two-stage training strategy consisting of adaptive reference horizon conditioning and blend-forcing autoregressive training, which progressively improves consistency and robustness under self-conditioning. To meet low-latency interaction requirements, we further integrate system-level efficiency optimizations for inference acceleration. Experiments on the nuScenes dataset demonstrate that our method achieves state-of-the-art performance among existing closed-loop autonomous driving simulation approaches, while maintaining sub-second latency on a single GPU.
Paper Structure (20 sections, 7 equations, 6 figures, 3 tables)

This paper contains 20 sections, 7 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of FAR-Drive framework for autonomous driving closed-loop simulation. Given driving agent actions, the simulator generates temporally and geometrically consistent multi-view video in an interactive rollout manner. Three core challenges of closed-loop generation are highlighted in different colors: long-horizon consistency, autoregressive degradation, and low-latency interaction. Adaptive reference-horizon conditioning, blend-forcing training, and system-level efficiency optimizations correspond to these challenges, with colors indicating the associated solutions.
  • Figure 2: Overview of the proposed multi-view MMDiT architecture. Reference frames and noise latents are fed into the backbone DiT while structured controls including projected 3D bounding boxes and BEV maps are first encoded by a convolutional encoder and then processed by a control DiT conditioned on the same latent inputs. All control signals are additionally aggregated into a unified scene prompt, which is injected into every layer of both the backbone and control DiTs. The outputs of each Control DiT block are injected into the corresponding backbone DiT blocks through zero-initialized projection layers (dashed arrows). Since the Control DiT has fewer layers than the backbone, control injection is applied only to the early backbone blocks.
  • Figure 3: Ablation on the proposed blend-forcing on long-horizon generation. We evaluate FID and FVD (lower is better) at different rollout lengths from 16 to 229 frames.
  • Figure 4: Qualitative comparison of long-horizon autoregressive generation before and after Blend-Forcing training. We generate a 229-frame six-view sequence and visualize the first frame (left) and the last frame (right). More cases in supplementary material.
  • Figure 5: Our model architecture consists of 7 backbone units (each containing 4 backbone MMDiT blocks and 1 cross-view attention block) and 3 control units. We have provided a detailed description of a single backbone unit and its corresponding control unit. In this design, every four backbone blocks are alternately arranged with a cross-view attention block.
  • ...and 1 more figures