Table of Contents
Fetching ...

SlotLifter: Slot-guided Feature Lifting for Learning Object-centric Radiance Fields

Yu Liu, Baoxiong Jia, Yixin Chen, Siyuan Huang

TL;DR

SlotLifter presents a novel object-centric radiance-field model that lifts 2D multi-view features into 3D point features and couples them with learnable object slots via cross-attention for joint scene reconstruction and decomposition. By using a slot-guided feature lifting pipeline and a single reconstruction loss $\\mathcal{L}_{ ext{recon}}$, the method achieves state-of-the-art results on eight diverse datasets while markedly reducing training time compared to previous 3D object-centric approaches. The approach demonstrates strong gains in both scene decomposition (ARI improvements) and novel-view synthesis (PSNR/LPIPS/SSIM) on synthetic and real-world data, including challenging real-world datasets like ScanNet and DTU. Ablation studies confirm the effectiveness of scene encoding, random masking, and slot-based density, and reveal the method’s sensitivity to the number of slots and source views, suggesting directions for further improvement with semantic priors or geometry priors. Overall, SlotLifter narrows the gap to real-world scene understanding with efficient, scalable object-centric 3D learning and rendering capabilities.

Abstract

The ability to distill object-centric abstractions from intricate visual scenes underpins human-level generalization. Despite the significant progress in object-centric learning methods, learning object-centric representations in the 3D physical world remains a crucial challenge. In this work, we propose SlotLifter, a novel object-centric radiance model addressing scene reconstruction and decomposition jointly via slot-guided feature lifting. Such a design unites object-centric learning representations and image-based rendering methods, offering state-of-the-art performance in scene decomposition and novel-view synthesis on four challenging synthetic and four complex real-world datasets, outperforming existing 3D object-centric learning methods by a large margin. Through extensive ablative studies, we showcase the efficacy of designs in SlotLifter, revealing key insights for potential future directions.

SlotLifter: Slot-guided Feature Lifting for Learning Object-centric Radiance Fields

TL;DR

SlotLifter presents a novel object-centric radiance-field model that lifts 2D multi-view features into 3D point features and couples them with learnable object slots via cross-attention for joint scene reconstruction and decomposition. By using a slot-guided feature lifting pipeline and a single reconstruction loss , the method achieves state-of-the-art results on eight diverse datasets while markedly reducing training time compared to previous 3D object-centric approaches. The approach demonstrates strong gains in both scene decomposition (ARI improvements) and novel-view synthesis (PSNR/LPIPS/SSIM) on synthetic and real-world data, including challenging real-world datasets like ScanNet and DTU. Ablation studies confirm the effectiveness of scene encoding, random masking, and slot-based density, and reveal the method’s sensitivity to the number of slots and source views, suggesting directions for further improvement with semantic priors or geometry priors. Overall, SlotLifter narrows the gap to real-world scene understanding with efficient, scalable object-centric 3D learning and rendering capabilities.

Abstract

The ability to distill object-centric abstractions from intricate visual scenes underpins human-level generalization. Despite the significant progress in object-centric learning methods, learning object-centric representations in the 3D physical world remains a crucial challenge. In this work, we propose SlotLifter, a novel object-centric radiance model addressing scene reconstruction and decomposition jointly via slot-guided feature lifting. Such a design unites object-centric learning representations and image-based rendering methods, offering state-of-the-art performance in scene decomposition and novel-view synthesis on four challenging synthetic and four complex real-world datasets, outperforming existing 3D object-centric learning methods by a large margin. Through extensive ablative studies, we showcase the efficacy of designs in SlotLifter, revealing key insights for potential future directions.
Paper Structure (52 sections, 10 equations, 18 figures, 14 tables)

This paper contains 52 sections, 10 equations, 18 figures, 14 tables.

Figures (18)

  • Figure 1: SlotLifter overview.SlotLifter extracts slots from input view(s) during slot encoding. It then lifts 2D feature maps of input view(s) to initialize 3D point features, which serve as queries in the allocation transformer for point-slot joint decoding. This process yields the point-slot mapping ${\bm{W}}_p$, density $\sigma$, and the slot-aggregated point feature ${\bm{F}}_s$ via an attention layer. Finally, SlotLifter uses these results for rendering novel-view images and segmentation masks via volume rendering.
  • Figure 2: Quantitative comparison for scene decomposition and novel view synthesis on Room-Texture.
  • Figure 2: Qualitative comparison on synthetic scenes. Compared to BO-uORF, SlotLifter renders novel-view images and segmentation masks in much higher quality, especially in detailed object attributes like color and shape (best viewed with zoom-in for the highlighted details).
  • Figure 3: Qualitative comparison on Room-Texture, Kitchen-Shiny, and Kitchen-Matte. Compared to the SOTA method uOCF, SlotLifter renders novel-view images and segmentation masks in higher quality, offering more complete segmentation and more detailed textures (best viewed with zoom-in for the highlighted details).
  • Figure 4: Qualitative results on ScanNet. Our SlotLifter achieves the best performance for novel-view rendering, even surpassing the recent state-of-the-art model GNT, while BO-uORF and OSRT struggle to render novel-view images on ScanNet.
  • ...and 13 more figures