Table of Contents
Fetching ...

ARCON: Advancing Auto-Regressive Continuation for Driving Videos

Ruibo Ming, Jingwei Wu, Zhewei Huang, Zhuoxuan Ju, Jianming HU, Lihui Peng, Shuchang Zhou

TL;DR

ARCON addresses long-horizon driving video continuation by leveraging an interleaved token strategy that alternates between semantic and RGB tokens within a large vision model. By using MAGVIT-v2 as a unified tokenizer and introducing a flow-based texture decoder, ARCON learns structural video information while maintaining texture fidelity, enabling minute-scale, coherent driving videos without fine-tuning on target datasets. Key findings include strong temporal consistency, high semantic-RGB correspondence, and competitive Fréchet Video Distance (FVD) results on nuScenes, with qualitative demonstrations of diverse futures and autonomous-driving knowledge. The work advances token-based video generation for world models in autonomous driving, offering a scalable path toward emergent planning and prediction capabilities in real-world driving scenarios.

Abstract

Recent advancements in auto-regressive large language models (LLMs) have led to their application in video generation. This paper explores the use of Large Vision Models (LVMs) for video continuation, a task essential for building world models and predicting future frames. We introduce ARCON, a scheme that alternates between generating semantic and RGB tokens, allowing the LVM to explicitly learn high-level structural video information. We find high consistency in the RGB images and semantic maps generated without special design. Moreover, we employ an optical flow-based texture stitching method to enhance visual quality. Experiments in autonomous driving scenarios show that our model can consistently generate long videos.

ARCON: Advancing Auto-Regressive Continuation for Driving Videos

TL;DR

ARCON addresses long-horizon driving video continuation by leveraging an interleaved token strategy that alternates between semantic and RGB tokens within a large vision model. By using MAGVIT-v2 as a unified tokenizer and introducing a flow-based texture decoder, ARCON learns structural video information while maintaining texture fidelity, enabling minute-scale, coherent driving videos without fine-tuning on target datasets. Key findings include strong temporal consistency, high semantic-RGB correspondence, and competitive Fréchet Video Distance (FVD) results on nuScenes, with qualitative demonstrations of diverse futures and autonomous-driving knowledge. The work advances token-based video generation for world models in autonomous driving, offering a scalable path toward emergent planning and prediction capabilities in real-world driving scenarios.

Abstract

Recent advancements in auto-regressive large language models (LLMs) have led to their application in video generation. This paper explores the use of Large Vision Models (LVMs) for video continuation, a task essential for building world models and predicting future frames. We introduce ARCON, a scheme that alternates between generating semantic and RGB tokens, allowing the LVM to explicitly learn high-level structural video information. We find high consistency in the RGB images and semantic maps generated without special design. Moreover, we employ an optical flow-based texture stitching method to enhance visual quality. Experiments in autonomous driving scenarios show that our model can consistently generate long videos.

Paper Structure

This paper contains 28 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Auto-regressively generated minute-level video using ARCON. We show a sample video clip from the BDD100K dataset. We auto-regressively generate $45$ frames given the first $3$ frames at $0.6$Hz. The ego-car moves forward in a short period and changes lanes to the right in preparation for a right turn. After the right turn, it continues to move forward. This example demonstrates that our model can generate reasonable first-view driving videos and can generate a completely new scene after the turn.
  • Figure 2: The structure of our ARCON model.left: We use Uniformer li2023uniformer to estimate the semantic maps. RGB images and semantic maps are encoded into discrete tokens using the same tokenizer yu2023language. right: We use an auto-regressive model to alternately predict RGB tokens and semantic tokens. During image decoding, the original frame can provide texture guidance for the generated results.
  • Figure 3: Flow-based feature warping in decoder. During the decoding of generated tokens, some auxiliary features can be transferred from reference tokens using a flow-based warping. The input to the right-side decoder is additionally concatenated with a warped feature. We can opt for higher-resolution reference frames to provide larger feature maps, and the feature maps of the generated frames are aligned with them through bilinear resizing.
  • Figure 4: Optical flow vector magnitude decay. We generate 150 15-frame clips on the nuScenes validation set and compute the optical flow mean magnitude with a pre-trained RAFT model teed2020raft between adjacent frames. A lower value indicates less motion, i.e., a more stationary video. Results demonstrate that continuing semantic maps helps mitigate degradation in video generation.
  • Figure 5: Consistency between generated semantic maps and RGB images. We use the same Uniformer li2023uniformer model used in the pipeline to perform semantic segmentation on the frame sequence generated by our ARCON model. We confirm there is a high degree of correspondence when the two modalities are generated alternately. This indicates that this approach implicitly decomposes the video sequencing task into a semantic sequencing task and a semantic map to the RGB translation task. Thanks to the auto-regressive paradigm, this translation task is video-consistent.
  • ...and 4 more figures