Table of Contents
Fetching ...

InfinityDrive: Breaking Time Limits in Driving World Models

Xi Guo, Chenjing Ding, Haoxuan Dou, Xin Zhang, Weixuan Tang, Wei Wu

TL;DR

InfinityDrive is introduced, the first driving world model with exceptional generalization capabilities, delivering state-of-the-art performance in high fidelity, consistency, and diversity with minute-scale video generation.

Abstract

Autonomous driving systems struggle with complex scenarios due to limited access to diverse, extensive, and out-of-distribution driving data which are critical for safe navigation. World models offer a promising solution to this challenge; however, current driving world models are constrained by short time windows and limited scenario diversity. To bridge this gap, we introduce InfinityDrive, the first driving world model with exceptional generalization capabilities, delivering state-of-the-art performance in high fidelity, consistency, and diversity with minute-scale video generation. InfinityDrive introduces an efficient spatio-temporal co-modeling module paired with an extended temporal training strategy, enabling high-resolution (576$\times$1024) video generation with consistent spatial and temporal coherence. By incorporating memory injection and retention mechanisms alongside an adaptive memory curve loss to minimize cumulative errors, achieving consistent video generation lasting over 1500 frames (more than 2 minutes). Comprehensive experiments in multiple datasets validate InfinityDrive's ability to generate complex and varied scenarios, highlighting its potential as a next-generation driving world model built for the evolving demands of autonomous driving. Our project homepage: https://metadrivescape.github.io/papers_project/InfinityDrive/page.html

InfinityDrive: Breaking Time Limits in Driving World Models

TL;DR

InfinityDrive is introduced, the first driving world model with exceptional generalization capabilities, delivering state-of-the-art performance in high fidelity, consistency, and diversity with minute-scale video generation.

Abstract

Autonomous driving systems struggle with complex scenarios due to limited access to diverse, extensive, and out-of-distribution driving data which are critical for safe navigation. World models offer a promising solution to this challenge; however, current driving world models are constrained by short time windows and limited scenario diversity. To bridge this gap, we introduce InfinityDrive, the first driving world model with exceptional generalization capabilities, delivering state-of-the-art performance in high fidelity, consistency, and diversity with minute-scale video generation. InfinityDrive introduces an efficient spatio-temporal co-modeling module paired with an extended temporal training strategy, enabling high-resolution (5761024) video generation with consistent spatial and temporal coherence. By incorporating memory injection and retention mechanisms alongside an adaptive memory curve loss to minimize cumulative errors, achieving consistent video generation lasting over 1500 frames (more than 2 minutes). Comprehensive experiments in multiple datasets validate InfinityDrive's ability to generate complex and varied scenarios, highlighting its potential as a next-generation driving world model built for the evolving demands of autonomous driving. Our project homepage: https://metadrivescape.github.io/papers_project/InfinityDrive/page.html

Paper Structure

This paper contains 22 sections, 7 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: InfinityDrive can generate long-term driving videos up to 1500 frames.
  • Figure 2: InfinityDrive Pipeline: We introduce an efficient spatio-temporal co-modeling module enhanced with memory injection and retention mechanisms. Combined with long-term training strategies and a memory curve adaptive loss, our model achieves high-resolution video generation lasting over 1500 frames.
  • Figure 3: The curves of FID and FVD as world models evolve with duration of time across different frames. We measure FID and FVD at frame 40, 80, 120, using the generated results of previous 40 frames at each time frame point.
  • Figure 4: Comparison of long-term video generation results under identical historical image conditions: a) SVD-XT become blur and oversaturated by 80 frames, and eventually fails; b) Vista ultimately lose all details; c) StreamingT2V loses details and displays inconsistencies d) Our model generates up to 1200 frames, preserving spatial detail and maintaining both long- and short-term temporal consistency (highlighted in red boxes).
  • Figure 5: Our long-term generation results on the opendv2k dataset. More results can be found in Appendix.
  • ...and 6 more figures