Unsupervised Discovery of Object-Centric Neural Fields

Rundong Luo; Hong-Xing Yu; Jiajun Wu

Unsupervised Discovery of Object-Centric Neural Fields

Rundong Luo, Hong-Xing Yu, Jiajun Wu

TL;DR

The paper tackles unsupervised discovery of 3D, object-centric scene representations from a single image, addressing the limitation that prior methods encode objects in the viewer's coordinates and thus struggle to generalize. It introduces Unsupervised discovery of Object-Centric neural Fields (uOCF), which disentangles object intrinsics from extrinsics and renders with object-centric NeRFs, enabling translation-invariant representations and single-image inference from sparse multi-view data. A two-stage training regime learns 3D object priors from simple synthetic scenes and transfers them to more complex real scenes, aided by a suite of losses and an object-centric sampling strategy; the model supports zero-shot generalization with test-time optimization. Empirically, uOCF outperforms state-of-the-art baselines on multiple tasks, demonstrates strong generalization to unseen configurations and objects, and enables 3D object segmentation and scene manipulation in real-world kitchen-like environments, with datasets and code to be released.

Abstract

We study inferring 3D object-centric scene representations from a single image. While recent methods have shown potential in unsupervised 3D object discovery from simple synthetic images, they fail to generalize to real-world scenes with visually rich and diverse objects. This limitation stems from their object representations, which entangle objects' intrinsic attributes like shape and appearance with extrinsic, viewer-centric properties such as their 3D location. To address this bottleneck, we propose Unsupervised discovery of Object-Centric neural Fields (uOCF). uOCF focuses on learning the intrinsics of objects and models the extrinsics separately. Our approach significantly improves systematic generalization, thus enabling unsupervised learning of high-fidelity object-centric scene representations from sparse real-world images. To evaluate our approach, we collect three new datasets, including two real kitchen environments. Extensive experiments show that uOCF enables unsupervised discovery of visually rich objects from a single real image, allowing applications such as 3D object segmentation and scene manipulation. Notably, uOCF demonstrates zero-shot generalization to unseen objects from a single real image. Project page: https://red-fairy.github.io/uOCF/

Unsupervised Discovery of Object-Centric Neural Fields

TL;DR

Abstract

Paper Structure (20 sections, 7 equations, 20 figures, 8 tables)

This paper contains 20 sections, 7 equations, 20 figures, 8 tables.

Introduction
Related Works
Approach
Model Overview
Object-Centric 3D Scene Modeling
Object Prior Learning
Model Training
Experiments
Baseline Comparison on Multiple Tasks
Generalization Analysis
Ablation Study
Conclusion
Appendix Overview
Proof of Concept
Implementation
...and 5 more sections

Figures (20)

Figure 1: We propose the unsupervised discovery of Object-Centric neural Fields (uOCF), which infers factorized 3D scene representations from an unseen real image, thus enabling scene reconstruction and manipulation from novel views. We compare uOCF with the state-of-the-art method, uORF uORF.
Figure 2: With a single forward pass, uOCF processes a single image input to infer a set of object-centric radiance fields along with their 3D locations and background radiance field. uOCF is trained on sparse multi-view images from a collection of scenes and uses a single image as input during inference.
Figure 3: Our object-centric design allows learning 3D object priors that generalize across different scene configurations. We first train our model to learn 3D object priors on simple synthetic scenes (e.g., single synthetic object), and then we leverage the 3D object priors to learn to discover objects in more complex scenes with different object categories and spatial layouts. Note that no object annotation is needed in either stage.
Figure 4: Samples from our collected datasets, where Room-Texture and Room-Furniture consist of synthetic images, and Kitchen-Matte and Kitchen-Shiny consist of real photos.
Figure 5: Scene segmentation qualitative results. Novel view images are for reference only.
...and 15 more figures

Unsupervised Discovery of Object-Centric Neural Fields

TL;DR

Abstract

Unsupervised Discovery of Object-Centric Neural Fields

Authors

TL;DR

Abstract

Table of Contents

Figures (20)