Table of Contents
Fetching ...

3D Scene Understanding Through Local Random Access Sequence Modeling

Wanhee Lee, Klemen Kotar, Rahul Mysore Venkatesh, Jared Watrous, Honglin Chen, Khai Loong Aw, Daniel L. K. Yamins

TL;DR

This work tackles 3D scene understanding from a single image by proposing Local Random Access Sequence (LRAS), an autoregressive model that combines local patch quantization with random-access decoding to enable stable object and scene manipulation. By conditioning on and predicting optical flow, LRAS supports high-quality novel view synthesis, 3D object edits, and emergent self-supervised depth estimation, trained on a large video dataset. The approach yields state-of-the-art performance on NVS and 3D editing tasks, while maintaining competitive or superior depth estimation, demonstrating LRAS as a unified, scalable alternative to diffusion-based methods for 3D vision. The results suggest LRAS can serve as a foundation for next-generation 3D vision models with broad editing and depth capabilities, driven by flow-conditioned autoregressive generation.

Abstract

3D scene understanding from single images is a pivotal problem in computer vision with numerous downstream applications in graphics, augmented reality, and robotics. While diffusion-based modeling approaches have shown promise, they often struggle to maintain object and scene consistency, especially in complex real-world scenarios. To address these limitations, we propose an autoregressive generative approach called Local Random Access Sequence (LRAS) modeling, which uses local patch quantization and randomly ordered sequence generation. By utilizing optical flow as an intermediate representation for 3D scene editing, our experiments demonstrate that LRAS achieves state-of-the-art novel view synthesis and 3D object manipulation capabilities. Furthermore, we show that our framework naturally extends to self-supervised depth estimation through a simple modification of the sequence design. By achieving strong performance on multiple 3D scene understanding tasks, LRAS provides a unified and effective framework for building the next generation of 3D vision models.

3D Scene Understanding Through Local Random Access Sequence Modeling

TL;DR

This work tackles 3D scene understanding from a single image by proposing Local Random Access Sequence (LRAS), an autoregressive model that combines local patch quantization with random-access decoding to enable stable object and scene manipulation. By conditioning on and predicting optical flow, LRAS supports high-quality novel view synthesis, 3D object edits, and emergent self-supervised depth estimation, trained on a large video dataset. The approach yields state-of-the-art performance on NVS and 3D editing tasks, while maintaining competitive or superior depth estimation, demonstrating LRAS as a unified, scalable alternative to diffusion-based methods for 3D vision. The results suggest LRAS can serve as a foundation for next-generation 3D vision models with broad editing and depth capabilities, driven by flow-conditioned autoregressive generation.

Abstract

3D scene understanding from single images is a pivotal problem in computer vision with numerous downstream applications in graphics, augmented reality, and robotics. While diffusion-based modeling approaches have shown promise, they often struggle to maintain object and scene consistency, especially in complex real-world scenarios. To address these limitations, we propose an autoregressive generative approach called Local Random Access Sequence (LRAS) modeling, which uses local patch quantization and randomly ordered sequence generation. By utilizing optical flow as an intermediate representation for 3D scene editing, our experiments demonstrate that LRAS achieves state-of-the-art novel view synthesis and 3D object manipulation capabilities. Furthermore, we show that our framework naturally extends to self-supervised depth estimation through a simple modification of the sequence design. By achieving strong performance on multiple 3D scene understanding tasks, LRAS provides a unified and effective framework for building the next generation of 3D vision models.

Paper Structure

This paper contains 21 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: LRAS Architecture.A. Quantization: We train a small, patch local, convolutional autoencoder with a 16 bit LFQ codebook. B. Serialization: We serialize the codes into sequences using the pointer-content representation, which allows us to arbitrarily order the patches during training and generation. C. Local Random Access Sequence Modeling: We train an LLM-like autoregressive transformer to predict the contents of the next patch, shuffled in random order. D. Sequence Design With Optical Flow: We design sequences of tokens that contain optical flow intermediates, to provide robust control over the generation. We train two models: $\textbf{LRAS}\xspace_\textbf{RGB}$, which is conditioned on a source RGB image and an optical flow describing the desired transformation to predict the next frame, and $\textbf{LRAS}\xspace_\textbf{FLOW}$, which is conditioned on a source RGB image to predict a plausible optical flow field.
  • Figure 2: 3D Scene Editing Through Flow Field Manipulation: We perform 3D scene edits by constructing optical flow fields corresponding to the desired transformations - either camera or object motion in 3D.
  • Figure 3: Novel view synthesis from a single image. The results show that our model performs controllable novel view synthesis with various camera motions in a diverse scenes. Compared to other models, the reconstructed images do not show abrupt change in object and scene identity. See supplementary for more results.
  • Figure 4: 3D object manipulation from a single image. We show that our model can perform both 3D object translation and rotation. Compared to the other methods, our model preserves object identity on real world images, and produces more photorealisic generated images with accurate object edits. See supplementary for more results.
  • Figure 5: Self-supervised monocular depth estimation. On static scenes, our model performs comparably well to existing self-supervised depth estimation methods. However, when there are dynamic objects in the scene, our model significantly outperforms geometric-consistency-based methods, demonstrating its robustness in handling moving objects. Yellow artifacts in ground truth depth maps are noise and excluded during the evaluation.
  • ...and 3 more figures