Table of Contents
Fetching ...

Learning 3D Robotics Perception using Inductive Priors

Muhammad Zubair Irshad

TL;DR

The thesis addresses data and annotation scarcity in robotic 3D perception by leveraging inductive priors that encode geometry, modularity, and contextual structure to enable sim2real and sim2sim generalization. It presents three integrated thrusts: Efficient Object-Centric Neural 3D Representations (CenterSnap, ShAPO), Hierarchical Vision-and-Language for Action (Robo-VLN, SASRA), and Generalizable Self-Supervised 3D Scene Understanding (NeO 360, NeRF-MAE). Empirically, CenterSnap achieves real-time multi-object 3D reconstruction and 6D pose/size estimation; ShAPO delivers superior 3D reconstruction with joint shape/appearance priors and octree optimization; Robo-VLN and SASRA show significant gains in continuous VLN tasks by fusing semantic maps and hierarchical reasoning; NeO 360 and NeRF-MAE demonstrate strong zero-shot and few-shot generalization for outdoor 360° scenes and NeRF pretraining, respectively. Collectively, the work advances data-efficient, priors-driven 3D perception for robotics, enabling robust operation in novel, cluttered environments with limited real-world data.

Abstract

Recent advances in deep learning have led to a data-centric intelligence i.e. artificially intelligent models unlocking the potential to ingest a large amount of data and be really good at performing digital tasks such as text-to-image generation, machine-human conversation, and image recognition. This thesis covers the topic of learning with structured inductive bias and priors to design approaches and algorithms unlocking the potential of principle-centric intelligence. Prior knowledge (priors for short), often available in terms of past experience as well as assumptions of how the world works, helps the autonomous agent generalize better and adapt their behavior based on past experience. In this thesis, I demonstrate the use of prior knowledge in three different robotics perception problems. 1. object-centric 3D reconstruction, 2. vision and language for decision-making, and 3. 3D scene understanding. To solve these challenging problems, I propose various sources of prior knowledge including 1. geometry and appearance priors from synthetic data, 2. modularity and semantic map priors and 3. semantic, structural, and contextual priors. I study these priors for solving robotics 3D perception tasks and propose ways to efficiently encode them in deep learning models. Some priors are used to warm-start the network for transfer learning, others are used as hard constraints to restrict the action space of robotics agents. While classical techniques are brittle and fail to generalize to unseen scenarios and data-centric approaches require a large amount of labeled data, this thesis aims to build intelligent agents which require very-less real-world data or data acquired only from simulation to generalize to highly dynamic and cluttered environments in novel simulations (i.e. sim2sim) or real-world unseen environments (i.e. sim2real) for a holistic scene understanding of the 3D world.

Learning 3D Robotics Perception using Inductive Priors

TL;DR

The thesis addresses data and annotation scarcity in robotic 3D perception by leveraging inductive priors that encode geometry, modularity, and contextual structure to enable sim2real and sim2sim generalization. It presents three integrated thrusts: Efficient Object-Centric Neural 3D Representations (CenterSnap, ShAPO), Hierarchical Vision-and-Language for Action (Robo-VLN, SASRA), and Generalizable Self-Supervised 3D Scene Understanding (NeO 360, NeRF-MAE). Empirically, CenterSnap achieves real-time multi-object 3D reconstruction and 6D pose/size estimation; ShAPO delivers superior 3D reconstruction with joint shape/appearance priors and octree optimization; Robo-VLN and SASRA show significant gains in continuous VLN tasks by fusing semantic maps and hierarchical reasoning; NeO 360 and NeRF-MAE demonstrate strong zero-shot and few-shot generalization for outdoor 360° scenes and NeRF pretraining, respectively. Collectively, the work advances data-efficient, priors-driven 3D perception for robotics, enabling robust operation in novel, cluttered environments with limited real-world data.

Abstract

Recent advances in deep learning have led to a data-centric intelligence i.e. artificially intelligent models unlocking the potential to ingest a large amount of data and be really good at performing digital tasks such as text-to-image generation, machine-human conversation, and image recognition. This thesis covers the topic of learning with structured inductive bias and priors to design approaches and algorithms unlocking the potential of principle-centric intelligence. Prior knowledge (priors for short), often available in terms of past experience as well as assumptions of how the world works, helps the autonomous agent generalize better and adapt their behavior based on past experience. In this thesis, I demonstrate the use of prior knowledge in three different robotics perception problems. 1. object-centric 3D reconstruction, 2. vision and language for decision-making, and 3. 3D scene understanding. To solve these challenging problems, I propose various sources of prior knowledge including 1. geometry and appearance priors from synthetic data, 2. modularity and semantic map priors and 3. semantic, structural, and contextual priors. I study these priors for solving robotics 3D perception tasks and propose ways to efficiently encode them in deep learning models. Some priors are used to warm-start the network for transfer learning, others are used as hard constraints to restrict the action space of robotics agents. While classical techniques are brittle and fail to generalize to unseen scenarios and data-centric approaches require a large amount of labeled data, this thesis aims to build intelligent agents which require very-less real-world data or data acquired only from simulation to generalize to highly dynamic and cluttered environments in novel simulations (i.e. sim2sim) or real-world unseen environments (i.e. sim2real) for a holistic scene understanding of the 3D world.
Paper Structure (112 sections, 33 equations, 54 figures, 28 tables, 1 algorithm)

This paper contains 112 sections, 33 equations, 54 figures, 28 tables, 1 algorithm.

Figures (54)

  • Figure 1: Overview:(1) Multi-stage pipelines in comparison to (2) our single-stage approach. The single-stage approach uses object instances as centers to jointly optimize 3D shape, 6D pose, and size.
  • Figure 2: CenterSnap Method: Given a single-view RGB-D observation, our proposed approach jointly optimizes for shape, pose, and sizes of each object in a single-shot manner. Our method comprises a joint backbone for feature extraction (Section \ref{['backbone']}), a pointcloud auto-encoder to extract shape codes from a large collection of CAD models (Section \ref{['shapecode']}), CenterSnap model which constitutes multiple specialized heads for heatmap and object-centric 3D parameter map prediction (Section \ref{['centermodel']}) and joint optimization for shape, pose, and sizes for each object's spatial center (Section \ref{['optimize']}).
  • Figure 3: Shape Auto-Encoder: We design a Point Auto-encoder (a) to find unique shape-code ($z_{i}$) for all the shapes. Unit canonicalized pointcloud outputs from the decoder network are shown in (b). t-SNE embeddings for shape-code ($z_{i}$) are visualized in (c)
  • Figure 4: Sim2Real Reconstruction: Single-shot sim2real shape reconstructions on NOCS showing pointclouds, meshes and textures.
  • Figure 5: Shape Completion: Chamfer distance (CD reported on y-axis) evaluation on Multi-object ShapeNet dataset.
  • ...and 49 more figures