Table of Contents
Fetching ...

Unsupervised Discovery of Object Radiance Fields

Hong-Xing Yu, Leonidas J. Guibas, Jiajun Wu

TL;DR

uORF introduces an unsupervised framework that decouples a scene into object radiance fields and a background field, inferred from a single image and rendered via a conditional NeRF. It uses a background-aware slot attention mechanism to discover and bind object-centric latents, followed by a compositional neural renderer and a coarse-to-fine training regime to manage compute and multimodal uncertainties. The method achieves state-of-the-art 3D segmentation from single views, plausible novel-view synthesis, and accessible 3D scene editing, across progressively complex synthetic datasets. Its results demonstrate that integrating neural rendering with deep inference enables robust, unsupervised, 3D-consistent scene decomposition and manipulation. This work suggests a viable path toward fully generative, 3D-aware scene representations without explicit 3D supervision or object-ategory annotations.

Abstract

We study the problem of inferring an object-centric scene representation from a single image, aiming to derive a representation that explains the image formation process, captures the scene's 3D nature, and is learned without supervision. Most existing methods on scene decomposition lack one or more of these characteristics, due to the fundamental challenge in integrating the complex 3D-to-2D image formation process into powerful inference schemes like deep networks. In this paper, we propose unsupervised discovery of Object Radiance Fields (uORF), integrating recent progresses in neural 3D scene representations and rendering with deep inference networks for unsupervised 3D scene decomposition. Trained on multi-view RGB images without annotations, uORF learns to decompose complex scenes with diverse, textured background from a single image. We show that uORF enables novel tasks, such as scene segmentation and editing in 3D, and it performs well on these tasks and on novel view synthesis on three datasets.

Unsupervised Discovery of Object Radiance Fields

TL;DR

uORF introduces an unsupervised framework that decouples a scene into object radiance fields and a background field, inferred from a single image and rendered via a conditional NeRF. It uses a background-aware slot attention mechanism to discover and bind object-centric latents, followed by a compositional neural renderer and a coarse-to-fine training regime to manage compute and multimodal uncertainties. The method achieves state-of-the-art 3D segmentation from single views, plausible novel-view synthesis, and accessible 3D scene editing, across progressively complex synthetic datasets. Its results demonstrate that integrating neural rendering with deep inference enables robust, unsupervised, 3D-consistent scene decomposition and manipulation. This work suggests a viable path toward fully generative, 3D-aware scene representations without explicit 3D supervision or object-ategory annotations.

Abstract

We study the problem of inferring an object-centric scene representation from a single image, aiming to derive a representation that explains the image formation process, captures the scene's 3D nature, and is learned without supervision. Most existing methods on scene decomposition lack one or more of these characteristics, due to the fundamental challenge in integrating the complex 3D-to-2D image formation process into powerful inference schemes like deep networks. In this paper, we propose unsupervised discovery of Object Radiance Fields (uORF), integrating recent progresses in neural 3D scene representations and rendering with deep inference networks for unsupervised 3D scene decomposition. Trained on multi-view RGB images without annotations, uORF learns to decompose complex scenes with diverse, textured background from a single image. We show that uORF enables novel tasks, such as scene segmentation and editing in 3D, and it performs well on these tasks and on novel view synthesis on three datasets.

Paper Structure

This paper contains 60 sections, 2 equations, 27 figures, 15 tables, 1 algorithm.

Figures (27)

  • Figure 1: Illustration of unsupervised discovery of Object Radiance Fields. We aim to infer factorized object and background radiance fields from a single view, allowing reconstructing and editing of the scene.
  • Figure 1: Comparison to existing methods.
  • Figure 2: Overview. I. Our model learns to infer a set of latents in a single forward pass. II. Each object/background radiance field consists of a latent and a shared conditional NeRF. III. During training, we recompose the scene and re-render images for supervision. We train our model on different scenes. At test time, we use a single image of an unseen scene for reconstruction or editing.
  • Figure 3: Our object-centric latent inference. The attention binds each object's features to a slot.
  • Figure 4: Examples on scene segmentation in 3D. Novel view images are for reference but not input.
  • ...and 22 more figures