Generative Modeling Perspective for Control and Reasoning in Robotics
Takuma Yoneda
TL;DR
This work advocates a generative modeling lens for robotics to address multimodal decision problems using diffusion models and GANs. It delivers three contributions: (i) a diffusion-based approach to learn a multimodal distribution of block poses conditioned on silhouettes for stable block stacking; (ii) a diffusion-guided shared autonomy framework that balances user intent and demonstrated behavior via partial forward diffusion controlled by a forward diffusion ratio $\gamma$; and (iii) an unsupervised domain adaptation method that aligns latent representations across source and target domains for visuomotor control without rewards. The findings show that diffusion models can generate diverse, stable, and silhouette-consistent block configurations, that partial diffusion can effectively trade fidelity for conformity in human-robot collaboration, and that latent-alignment can substantially improve cross-domain generalization and sim-to-real transfer. Together, these results highlight a practical pathway for robust, multimodal, and transferable robotic control in unstructured environments, with potential impact on collaborative manipulation and autonomous decision-making.
Abstract
Heralded by the initial success in speech recognition and image classification, learning-based approaches with neural networks, commonly referred to as deep learning, have spread across various fields. A primitive form of a neural network functions as a deterministic mapping from one vector to another, parameterized by trainable weights. This is well suited for point estimation in which the model learns a one-to-one mapping (e.g., mapping a front camera view to a steering angle) that is required to solve the task of interest. Although learning such a deterministic, one-to-one mapping is effective, there are scenarios where modeling \emph{multimodal} data distributions, namely learning one-to-many relationships, is helpful or even necessary. In this thesis, we adopt a generative modeling perspective on robotics problems. Generative models learn and produce samples from multimodal distributions, rather than performing point estimation. We will explore the advantages this perspective offers for three topics in robotics.
