Table of Contents
Fetching ...

Unsupervised Discovery of Object-Centric Neural Fields

Rundong Luo, Hong-Xing Yu, Jiajun Wu

TL;DR

The paper tackles unsupervised discovery of 3D, object-centric scene representations from a single image, addressing the limitation that prior methods encode objects in the viewer's coordinates and thus struggle to generalize. It introduces Unsupervised discovery of Object-Centric neural Fields (uOCF), which disentangles object intrinsics from extrinsics and renders with object-centric NeRFs, enabling translation-invariant representations and single-image inference from sparse multi-view data. A two-stage training regime learns 3D object priors from simple synthetic scenes and transfers them to more complex real scenes, aided by a suite of losses and an object-centric sampling strategy; the model supports zero-shot generalization with test-time optimization. Empirically, uOCF outperforms state-of-the-art baselines on multiple tasks, demonstrates strong generalization to unseen configurations and objects, and enables 3D object segmentation and scene manipulation in real-world kitchen-like environments, with datasets and code to be released.

Abstract

We study inferring 3D object-centric scene representations from a single image. While recent methods have shown potential in unsupervised 3D object discovery from simple synthetic images, they fail to generalize to real-world scenes with visually rich and diverse objects. This limitation stems from their object representations, which entangle objects' intrinsic attributes like shape and appearance with extrinsic, viewer-centric properties such as their 3D location. To address this bottleneck, we propose Unsupervised discovery of Object-Centric neural Fields (uOCF). uOCF focuses on learning the intrinsics of objects and models the extrinsics separately. Our approach significantly improves systematic generalization, thus enabling unsupervised learning of high-fidelity object-centric scene representations from sparse real-world images. To evaluate our approach, we collect three new datasets, including two real kitchen environments. Extensive experiments show that uOCF enables unsupervised discovery of visually rich objects from a single real image, allowing applications such as 3D object segmentation and scene manipulation. Notably, uOCF demonstrates zero-shot generalization to unseen objects from a single real image. Project page: https://red-fairy.github.io/uOCF/

Unsupervised Discovery of Object-Centric Neural Fields

TL;DR

The paper tackles unsupervised discovery of 3D, object-centric scene representations from a single image, addressing the limitation that prior methods encode objects in the viewer's coordinates and thus struggle to generalize. It introduces Unsupervised discovery of Object-Centric neural Fields (uOCF), which disentangles object intrinsics from extrinsics and renders with object-centric NeRFs, enabling translation-invariant representations and single-image inference from sparse multi-view data. A two-stage training regime learns 3D object priors from simple synthetic scenes and transfers them to more complex real scenes, aided by a suite of losses and an object-centric sampling strategy; the model supports zero-shot generalization with test-time optimization. Empirically, uOCF outperforms state-of-the-art baselines on multiple tasks, demonstrates strong generalization to unseen configurations and objects, and enables 3D object segmentation and scene manipulation in real-world kitchen-like environments, with datasets and code to be released.

Abstract

We study inferring 3D object-centric scene representations from a single image. While recent methods have shown potential in unsupervised 3D object discovery from simple synthetic images, they fail to generalize to real-world scenes with visually rich and diverse objects. This limitation stems from their object representations, which entangle objects' intrinsic attributes like shape and appearance with extrinsic, viewer-centric properties such as their 3D location. To address this bottleneck, we propose Unsupervised discovery of Object-Centric neural Fields (uOCF). uOCF focuses on learning the intrinsics of objects and models the extrinsics separately. Our approach significantly improves systematic generalization, thus enabling unsupervised learning of high-fidelity object-centric scene representations from sparse real-world images. To evaluate our approach, we collect three new datasets, including two real kitchen environments. Extensive experiments show that uOCF enables unsupervised discovery of visually rich objects from a single real image, allowing applications such as 3D object segmentation and scene manipulation. Notably, uOCF demonstrates zero-shot generalization to unseen objects from a single real image. Project page: https://red-fairy.github.io/uOCF/
Paper Structure (20 sections, 7 equations, 20 figures, 8 tables)

This paper contains 20 sections, 7 equations, 20 figures, 8 tables.

Figures (20)

  • Figure 1: We propose the unsupervised discovery of Object-Centric neural Fields (uOCF), which infers factorized 3D scene representations from an unseen real image, thus enabling scene reconstruction and manipulation from novel views. We compare uOCF with the state-of-the-art method, uORF uORF.
  • Figure 2: With a single forward pass, uOCF processes a single image input to infer a set of object-centric radiance fields along with their 3D locations and background radiance field. uOCF is trained on sparse multi-view images from a collection of scenes and uses a single image as input during inference.
  • Figure 3: Our object-centric design allows learning 3D object priors that generalize across different scene configurations. We first train our model to learn 3D object priors on simple synthetic scenes (e.g., single synthetic object), and then we leverage the 3D object priors to learn to discover objects in more complex scenes with different object categories and spatial layouts. Note that no object annotation is needed in either stage.
  • Figure 4: Samples from our collected datasets, where Room-Texture and Room-Furniture consist of synthetic images, and Kitchen-Matte and Kitchen-Shiny consist of real photos.
  • Figure 5: Scene segmentation qualitative results. Novel view images are for reference only.
  • ...and 15 more figures