Table of Contents
Fetching ...

Deep Generative Models in Robotics: A Survey on Learning from Multimodal Demonstrations

Julen Urain, Ajay Mandlekar, Yilun Du, Mahi Shafiullah, Danfei Xu, Katerina Fragkiadaki, Georgia Chalvatzaki, Jan Peters

TL;DR

The paper formalizes learning robot policies from multimodal demonstrations by optimizing a conditioned density model $rho_theta(a|c)$ to approximate the true distribution $rho_D(a|c)$. It provides a unified taxonomy of five density-estimation families Sampling Models Energy-Based Models Diffusion Models Categorical Models and Mixture Density Models and maps each to robotics tasks such as grasping trajectory generation and scene arrangement. It then discusses generalization strategies including modular composition informative feature extraction and perception-action symmetry, and outlines key challenges in offline multimodal learning for robotics. The survey highlights practical implications for selecting model families, training and sampling strategies, and integrating generative models with perception and planning to enable robust long-horizon robotic behavior. It also points to future directions such as learning from video and synthetic data, online interaction, and leveraging 3D feature fields and foundation models for improved grounding and generalization.

Abstract

Learning from Demonstrations, the field that proposes to learn robot behavior models from data, is gaining popularity with the emergence of deep generative models. Although the problem has been studied for years under names such as Imitation Learning, Behavioral Cloning, or Inverse Reinforcement Learning, classical methods have relied on models that don't capture complex data distributions well or don't scale well to large numbers of demonstrations. In recent years, the robot learning community has shown increasing interest in using deep generative models to capture the complexity of large datasets. In this survey, we aim to provide a unified and comprehensive review of the last year's progress in the use of deep generative models in robotics. We present the different types of models that the community has explored, such as energy-based models, diffusion models, action value maps, or generative adversarial networks. We also present the different types of applications in which deep generative models have been used, from grasp generation to trajectory generation or cost learning. One of the most important elements of generative models is the generalization out of distributions. In our survey, we review the different decisions the community has made to improve the generalization of the learned models. Finally, we highlight the research challenges and propose a number of future directions for learning deep generative models in robotics.

Deep Generative Models in Robotics: A Survey on Learning from Multimodal Demonstrations

TL;DR

The paper formalizes learning robot policies from multimodal demonstrations by optimizing a conditioned density model to approximate the true distribution . It provides a unified taxonomy of five density-estimation families Sampling Models Energy-Based Models Diffusion Models Categorical Models and Mixture Density Models and maps each to robotics tasks such as grasping trajectory generation and scene arrangement. It then discusses generalization strategies including modular composition informative feature extraction and perception-action symmetry, and outlines key challenges in offline multimodal learning for robotics. The survey highlights practical implications for selecting model families, training and sampling strategies, and integrating generative models with perception and planning to enable robust long-horizon robotic behavior. It also points to future directions such as learning from video and synthetic data, online interaction, and leveraging 3D feature fields and foundation models for improved grounding and generalization.

Abstract

Learning from Demonstrations, the field that proposes to learn robot behavior models from data, is gaining popularity with the emergence of deep generative models. Although the problem has been studied for years under names such as Imitation Learning, Behavioral Cloning, or Inverse Reinforcement Learning, classical methods have relied on models that don't capture complex data distributions well or don't scale well to large numbers of demonstrations. In recent years, the robot learning community has shown increasing interest in using deep generative models to capture the complexity of large datasets. In this survey, we aim to provide a unified and comprehensive review of the last year's progress in the use of deep generative models in robotics. We present the different types of models that the community has explored, such as energy-based models, diffusion models, action value maps, or generative adversarial networks. We also present the different types of applications in which deep generative models have been used, from grasp generation to trajectory generation or cost learning. One of the most important elements of generative models is the generalization out of distributions. In our survey, we review the different decisions the community has made to improve the generalization of the learned models. Finally, we highlight the research challenges and propose a number of future directions for learning deep generative models in robotics.
Paper Structure (29 sections, 21 equations, 11 figures)

This paper contains 29 sections, 21 equations, 11 figures.

Figures (11)

  • Figure 1: Structure of the survey with references to the sections.
  • Figure 2: Selected publications for this survey per year. Different colors indicate different types of . We categorize into five classes.
  • Figure 3: Left: A visual representation of Sampling Models. Given a latent sample ${\bm{z}}$, usually sampled from a normal distribution, Sampling Models generate an action sample through a learned decoder ${\bm{a}} = {\bm{D}}_{{\boldsymbol{\theta}}}({\bm{z}},{\bm{c}})$. Right: A representation of common applications for Sampling Models: as sampling distribution ichter2018learning, as behavior prior singh2020parrot and, as generative model mousavian20196.
  • Figure 4: Left: A visual representation of an . Given as input an action variable ${\bm{a}}$, output the unnormalized log probability of the input action $e = {\bm{E}}_{{\boldsymbol{\theta}}}({\bm{a}},{\bm{c}})$. Right: A visual representation of the different strategies to train or represent an : Contrastive Divergence finn2016deep, Supervised Learning weng2023neural and, Neural Descriptor Fields simeonov2022neural.
  • Figure 5: Left: A visual representation of a . Given an action ${\bm{a}}$ and a scalar $k$ informing on the diffusion step, the model ${\bm{s}}={\bm{S}}_{{\boldsymbol{\theta}}}({\bm{a}},{\bm{c}},k)$ outputs a vector ${\bm{s}}$ conditioned on ${\bm{c}}$. The output ${\bm{s}}$ is related to the score of a distribution $\rho({\bm{a}}_k)$. Right: A visualization of the denoising process janner2022planning.
  • ...and 6 more figures