UFO: Unifying Feed-Forward and Optimization-based Methods for Large Driving Scene Modeling

Kaiyuan Tan; Yingying Shen; Mingfei Tu; Haohui Zhu; Bing Wang; Guang Chen; Hangjun Ye; Haiyang Sun

UFO: Unifying Feed-Forward and Optimization-based Methods for Large Driving Scene Modeling

Kaiyuan Tan, Yingying Shen, Mingfei Tu, Haohui Zhu, Bing Wang, Guang Chen, Hangjun Ye, Haiyang Sun

TL;DR

UFO is proposed, a novel recurrent paradigm that combines the benefits of optimization-based and feed-forward methods for efficient long-range 4D reconstruction of dynamic objects and introduces an object pose-guided modeling approach that supports accurate long-range motion capture.

Abstract

Dynamic driving scene reconstruction is critical for autonomous driving simulation and closed-loop learning. While recent feed-forward methods have shown promise for 3D reconstruction, they struggle with long-range driving sequences due to quadratic complexity in sequence length and challenges in modeling dynamic objects over extended durations. We propose UFO, a novel recurrent paradigm that combines the benefits of optimization-based and feed-forward methods for efficient long-range 4D reconstruction. Our approach maintains a 4D scene representation that is iteratively refined as new observations arrive, using a visibility-based filtering mechanism to select informative scene tokens and enable efficient processing of long sequences. For dynamic objects, we introduce an object pose-guided modeling approach that supports accurate long-range motion capture. Experiments on the Waymo Open Dataset demonstrate that our method significantly outperforms both per-scene optimization and existing feed-forward methods across various sequence lengths. Notably, our approach can reconstruct 16-second driving logs within 0.5 second while maintaining superior visual quality and geometric accuracy.

UFO: Unifying Feed-Forward and Optimization-based Methods for Large Driving Scene Modeling

TL;DR

Abstract

Paper Structure (33 sections, 12 equations, 9 figures, 2 tables, 1 algorithm)

This paper contains 33 sections, 12 equations, 9 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Feed-Forward Reconstruction
Per-Scene Reconstruction for Driving Scenes
Learning to Optimize for NVS
Method
Problem Formulation.
Overview.
Scene Representation
Network Inputs
Recurrent Scene Update
Update Formulation.
Visibility-Based Filtering.
Local Coordinate System.
Dynamic Object Modeling
...and 18 more sections

Figures (9)

Figure 1: (a) Per-scene optimization methods rely on complex update pipelines to iteratively refine scene representations. (b) Feed-forward methods directly predict 3D representations from image pixels. (c) Our UFO integrates the strengths of both paradigms by abstracting the render-supervise-update process into a single holistic transformer, enabling efficient long-range 4D reconstruction in a recurrent manner.
Figure 2: Overview of our proposed framework. Given a long sequence of multi-view images, we reconstruct the 4D scene in a recurrent manner. (A) At each time step, we update the scene representation by refining previous scene tokens based on the new observation and adding new information from the current frame. (B) To efficiently handle long sequences, we employ a visibility-based filtering mechanism to select relevant scene tokens for updating. A unified transformer model learns to update the scene in a feed-forward manner. (C) Dynamic objects are modeled using 3D bounding boxes and per-Gaussian lifespans, enabling complex motion modeling over time.
Figure 3: Qualitative comparison on novel view synthesis.
Figure 4: Inference Time and Memory Usage Comparison. We compare inference time and memory usage for different sequence lengths between STORM and our method. Results are reported excluding Gaussian rendering stage.
Figure 5: Visualization of dynamic object modeling. Lifespan and motion assignment enable accurate modeling of dynamic objects. Top: rendered RGB at a novel timestep. Middle: rendered lifespan map, where blue indicates transient objects with short lifespan. Bottom: rendered motion-assignment map; different colors indicate different objects. We leverage object poses to transform Gaussians accordingly.
...and 4 more figures

UFO: Unifying Feed-Forward and Optimization-based Methods for Large Driving Scene Modeling

TL;DR

Abstract

UFO: Unifying Feed-Forward and Optimization-based Methods for Large Driving Scene Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (9)