Table of Contents
Fetching ...

Object Learning and Robust 3D Reconstruction

Sara Sabour

TL;DR

This thesis investigates unsupervised object representations for robust image understanding and 3D reconstruction. It introduces FlowCapsules for motion-driven 2D part learning, RobustNeRF for masking transient distractors during NeRF training via robust losses, and SpotLessSplats for robust 3D Gaussian Splatting using semantic feature cues. Across these contributions, the work demonstrates that unsupervised, object-centric representations can achieve strong segmentation, shape completion, and distortion-free 3D reconstructions in casual capture scenarios, often surpassing supervised baselines. The findings highlight practical pathways to scalable, flexible 3D scene modeling in-the-wild, with implications for real-world AR/VR, robotics, and content creation.

Abstract

In this thesis we discuss architectural designs and training methods for a neural network to have the ability of dissecting an image into objects of interest without supervision. The main challenge in 2D unsupervised object segmentation is distinguishing between foreground objects of interest and background. FlowCapsules uses motion as a cue for the objects of interest in 2D scenarios. The last part of this thesis focuses on 3D applications where the goal is detecting and removal of the object of interest from the input images. In these tasks, we leverage the geometric consistency of scenes in 3D to detect the inconsistent dynamic objects. Our transient object masks are then used for designing robust optimization kernels to improve 3D modelling in a casual capture setup. One of our goals in this thesis is to show the merits of unsupervised object based approaches in computer vision. Furthermore, we suggest possible directions for defining objects of interest or foreground objects without requiring supervision. Our hope is to motivate and excite the community into further exploring explicit object representations in image understanding tasks.

Object Learning and Robust 3D Reconstruction

TL;DR

This thesis investigates unsupervised object representations for robust image understanding and 3D reconstruction. It introduces FlowCapsules for motion-driven 2D part learning, RobustNeRF for masking transient distractors during NeRF training via robust losses, and SpotLessSplats for robust 3D Gaussian Splatting using semantic feature cues. Across these contributions, the work demonstrates that unsupervised, object-centric representations can achieve strong segmentation, shape completion, and distortion-free 3D reconstructions in casual capture scenarios, often surpassing supervised baselines. The findings highlight practical pathways to scalable, flexible 3D scene modeling in-the-wild, with implications for real-world AR/VR, robotics, and content creation.

Abstract

In this thesis we discuss architectural designs and training methods for a neural network to have the ability of dissecting an image into objects of interest without supervision. The main challenge in 2D unsupervised object segmentation is distinguishing between foreground objects of interest and background. FlowCapsules uses motion as a cue for the objects of interest in 2D scenarios. The last part of this thesis focuses on 3D applications where the goal is detecting and removal of the object of interest from the input images. In these tasks, we leverage the geometric consistency of scenes in 3D to detect the inconsistent dynamic objects. Our transient object masks are then used for designing robust optimization kernels to improve 3D modelling in a casual capture setup. One of our goals in this thesis is to show the merits of unsupervised object based approaches in computer vision. Furthermore, we suggest possible directions for defining objects of interest or foreground objects without requiring supervision. Our hope is to motivate and excite the community into further exploring explicit object representations in image understanding tasks.

Paper Structure

This paper contains 105 sections, 30 equations, 49 figures, 4 tables.

Figures (49)

  • Figure 1: Examples of human visual perception properties based on gestalt psychology.
  • Figure 2: Deep Learning has different principals for image understanding than humans.
  • Figure 3: Self-supervised training for learning primary capsules: An image encoder is trained to decompose the scene into a collection of primary capsules. Learning is accomplished in an unsupervised manner, using flow estimation from capsule shapes and poses as a proxy task.
  • Figure 4: Inference architecture. (left) The encoder $\mathcal{E}_\omega$ parses an image into part capsules, each comprising a shape vector $s_k$, a pose $\theta_k$, and a scalar depth value $d_k$. (right) The shape decoder $\mathcal{D}_\omega$ is an implicit function. It takes as input a shape vector, $s_k$, and a location in canonical coordinates and returns the probability that the location is inside the shape. Shapes are mapped to image coordinates, using $\theta_k$, and layered according to the relative depths $d_k$, yielding visibility masks.
  • Figure 5: Encoder architecture. The encoder comprises convolution layers with ReLU activation, followed by down-sampling via $2\! \times \!2$ AveragePooling. Following the last convolution layer is a tanh fully connected layer, and a fully connected layer grouped into $K$, $C$-dimensional capsules.
  • ...and 44 more figures