Table of Contents
Fetching ...

UFO: Unifying Feed-Forward and Optimization-based Methods for Large Driving Scene Modeling

Kaiyuan Tan, Yingying Shen, Mingfei Tu, Haohui Zhu, Bing Wang, Guang Chen, Hangjun Ye, Haiyang Sun

TL;DR

UFO is proposed, a novel recurrent paradigm that combines the benefits of optimization-based and feed-forward methods for efficient long-range 4D reconstruction of dynamic objects and introduces an object pose-guided modeling approach that supports accurate long-range motion capture.

Abstract

Dynamic driving scene reconstruction is critical for autonomous driving simulation and closed-loop learning. While recent feed-forward methods have shown promise for 3D reconstruction, they struggle with long-range driving sequences due to quadratic complexity in sequence length and challenges in modeling dynamic objects over extended durations. We propose UFO, a novel recurrent paradigm that combines the benefits of optimization-based and feed-forward methods for efficient long-range 4D reconstruction. Our approach maintains a 4D scene representation that is iteratively refined as new observations arrive, using a visibility-based filtering mechanism to select informative scene tokens and enable efficient processing of long sequences. For dynamic objects, we introduce an object pose-guided modeling approach that supports accurate long-range motion capture. Experiments on the Waymo Open Dataset demonstrate that our method significantly outperforms both per-scene optimization and existing feed-forward methods across various sequence lengths. Notably, our approach can reconstruct 16-second driving logs within 0.5 second while maintaining superior visual quality and geometric accuracy.

UFO: Unifying Feed-Forward and Optimization-based Methods for Large Driving Scene Modeling

TL;DR

UFO is proposed, a novel recurrent paradigm that combines the benefits of optimization-based and feed-forward methods for efficient long-range 4D reconstruction of dynamic objects and introduces an object pose-guided modeling approach that supports accurate long-range motion capture.

Abstract

Dynamic driving scene reconstruction is critical for autonomous driving simulation and closed-loop learning. While recent feed-forward methods have shown promise for 3D reconstruction, they struggle with long-range driving sequences due to quadratic complexity in sequence length and challenges in modeling dynamic objects over extended durations. We propose UFO, a novel recurrent paradigm that combines the benefits of optimization-based and feed-forward methods for efficient long-range 4D reconstruction. Our approach maintains a 4D scene representation that is iteratively refined as new observations arrive, using a visibility-based filtering mechanism to select informative scene tokens and enable efficient processing of long sequences. For dynamic objects, we introduce an object pose-guided modeling approach that supports accurate long-range motion capture. Experiments on the Waymo Open Dataset demonstrate that our method significantly outperforms both per-scene optimization and existing feed-forward methods across various sequence lengths. Notably, our approach can reconstruct 16-second driving logs within 0.5 second while maintaining superior visual quality and geometric accuracy.
Paper Structure (33 sections, 12 equations, 9 figures, 2 tables, 1 algorithm)

This paper contains 33 sections, 12 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: (a) Per-scene optimization methods rely on complex update pipelines to iteratively refine scene representations. (b) Feed-forward methods directly predict 3D representations from image pixels. (c) Our UFO integrates the strengths of both paradigms by abstracting the render-supervise-update process into a single holistic transformer, enabling efficient long-range 4D reconstruction in a recurrent manner.
  • Figure 2: Overview of our proposed framework. Given a long sequence of multi-view images, we reconstruct the 4D scene in a recurrent manner. (A) At each time step, we update the scene representation by refining previous scene tokens based on the new observation and adding new information from the current frame. (B) To efficiently handle long sequences, we employ a visibility-based filtering mechanism to select relevant scene tokens for updating. A unified transformer model learns to update the scene in a feed-forward manner. (C) Dynamic objects are modeled using 3D bounding boxes and per-Gaussian lifespans, enabling complex motion modeling over time.
  • Figure 3: Qualitative comparison on novel view synthesis.
  • Figure 4: Inference Time and Memory Usage Comparison. We compare inference time and memory usage for different sequence lengths between STORM and our method. Results are reported excluding Gaussian rendering stage.
  • Figure 5: Visualization of dynamic object modeling. Lifespan and motion assignment enable accurate modeling of dynamic objects. Top: rendered RGB at a novel timestep. Middle: rendered lifespan map, where blue indicates transient objects with short lifespan. Bottom: rendered motion-assignment map; different colors indicate different objects. We leverage object poses to transform Gaussians accordingly.
  • ...and 4 more figures