Table of Contents
Fetching ...

Generative Modeling Perspective for Control and Reasoning in Robotics

Takuma Yoneda

TL;DR

This work advocates a generative modeling lens for robotics to address multimodal decision problems using diffusion models and GANs. It delivers three contributions: (i) a diffusion-based approach to learn a multimodal distribution of block poses conditioned on silhouettes for stable block stacking; (ii) a diffusion-guided shared autonomy framework that balances user intent and demonstrated behavior via partial forward diffusion controlled by a forward diffusion ratio $\gamma$; and (iii) an unsupervised domain adaptation method that aligns latent representations across source and target domains for visuomotor control without rewards. The findings show that diffusion models can generate diverse, stable, and silhouette-consistent block configurations, that partial diffusion can effectively trade fidelity for conformity in human-robot collaboration, and that latent-alignment can substantially improve cross-domain generalization and sim-to-real transfer. Together, these results highlight a practical pathway for robust, multimodal, and transferable robotic control in unstructured environments, with potential impact on collaborative manipulation and autonomous decision-making.

Abstract

Heralded by the initial success in speech recognition and image classification, learning-based approaches with neural networks, commonly referred to as deep learning, have spread across various fields. A primitive form of a neural network functions as a deterministic mapping from one vector to another, parameterized by trainable weights. This is well suited for point estimation in which the model learns a one-to-one mapping (e.g., mapping a front camera view to a steering angle) that is required to solve the task of interest. Although learning such a deterministic, one-to-one mapping is effective, there are scenarios where modeling \emph{multimodal} data distributions, namely learning one-to-many relationships, is helpful or even necessary. In this thesis, we adopt a generative modeling perspective on robotics problems. Generative models learn and produce samples from multimodal distributions, rather than performing point estimation. We will explore the advantages this perspective offers for three topics in robotics.

Generative Modeling Perspective for Control and Reasoning in Robotics

TL;DR

This work advocates a generative modeling lens for robotics to address multimodal decision problems using diffusion models and GANs. It delivers three contributions: (i) a diffusion-based approach to learn a multimodal distribution of block poses conditioned on silhouettes for stable block stacking; (ii) a diffusion-guided shared autonomy framework that balances user intent and demonstrated behavior via partial forward diffusion controlled by a forward diffusion ratio ; and (iii) an unsupervised domain adaptation method that aligns latent representations across source and target domains for visuomotor control without rewards. The findings show that diffusion models can generate diverse, stable, and silhouette-consistent block configurations, that partial diffusion can effectively trade fidelity for conformity in human-robot collaboration, and that latent-alignment can substantially improve cross-domain generalization and sim-to-real transfer. Together, these results highlight a practical pathway for robust, multimodal, and transferable robotic control in unstructured environments, with potential impact on collaborative manipulation and autonomous decision-making.

Abstract

Heralded by the initial success in speech recognition and image classification, learning-based approaches with neural networks, commonly referred to as deep learning, have spread across various fields. A primitive form of a neural network functions as a deterministic mapping from one vector to another, parameterized by trainable weights. This is well suited for point estimation in which the model learns a one-to-one mapping (e.g., mapping a front camera view to a steering angle) that is required to solve the task of interest. Although learning such a deterministic, one-to-one mapping is effective, there are scenarios where modeling \emph{multimodal} data distributions, namely learning one-to-many relationships, is helpful or even necessary. In this thesis, we adopt a generative modeling perspective on robotics problems. Generative models learn and produce samples from multimodal distributions, rather than performing point estimation. We will explore the advantages this perspective offers for three topics in robotics.
Paper Structure (57 sections, 36 equations, 20 figures, 6 tables, 4 algorithms)

This paper contains 57 sections, 36 equations, 20 figures, 6 tables, 4 algorithms.

Figures (20)

  • Figure 1: The overall pipeline of our approach. We design a diffusion model that takes a silhouette of a structure as well as available block shapes, and generate a set of block poses $\hat{\bm{p}}_{1}, \ldots \hat{\bm{p}}_k$ that makes up a stable stack matching the silhouette. We further demonstrate that we can apply our approach in the real block stacking task.
  • Figure 2: Our strategy to generate diverse set of stable stacks. After filling the design grid with shapes, we add a small horizontal displacement (sampled at random) at each layer and spawn them in simulation, verify its stability, and from there we try removing each block to create its variations.
  • Figure 3: Our architecture for diffusion models.
  • Figure 4: Silhouettes from the heldout dataset and rendering of block poses generated by our model.
  • Figure 5: A reference stack (ground truth) with its silhouette (left), and a diverse set of structures generated from the silhouette by our model (right).
  • ...and 15 more figures