Table of Contents
Fetching ...

Generative Artificial Intelligence in Robotic Manipulation: A Survey

Kun Zhang, Peng Yun, Jun Cen, Junhao Cai, Didi Zhu, Hangjie Yuan, Chao Zhao, Tao Feng, Michael Yu Wang, Qifeng Chen, Jia Pan, Wei Zhang, Bo Yang, Hua Chen

TL;DR

The survey addresses data efficiency, long-horizon planning, and cross-environment generalization in robotic manipulation by systematically reviewing how generative learning models can generate data, model world dynamics, and synthesize policies. It introduces a three-layer taxonomy—Foundation (data/reward generation), Intermediate (language/code/visual/state generation), and Policy (grasp/trajectory generation)—to organize a broad spectrum of methods including GANs, VAEs, diffusion models, probabilistic flows, and autoregressive models. The work compiles representative methods and discusses challenges such as data scarcity, sim-to-real transfer, benchmark fragmentation, and physical-law awareness, while outlining concrete directions like domain grounding, unified benchmarks, and physics-informed learning. The practical impact lies in guiding researchers toward scalable data pipelines, multi-modal policy learning, and robust, generalizable robotic manipulation systems across real-world environments.

Abstract

This survey provides a comprehensive review on recent advancements of generative learning models in robotic manipulation, addressing key challenges in the field. Robotic manipulation faces critical bottlenecks, including significant challenges in insufficient data and inefficient data acquisition, long-horizon and complex task planning, and the multi-modality reasoning ability for robust policy learning performance across diverse environments. To tackle these challenges, this survey introduces several generative model paradigms, including Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), diffusion models, probabilistic flow models, and autoregressive models, highlighting their strengths and limitations. The applications of these models are categorized into three hierarchical layers: the Foundation Layer, focusing on data generation and reward generation; the Intermediate Layer, covering language, code, visual, and state generation; and the Policy Layer, emphasizing grasp generation and trajectory generation. Each layer is explored in detail, along with notable works that have advanced the state of the art. Finally, the survey outlines future research directions and challenges, emphasizing the need for improved efficiency in data utilization, better handling of long-horizon tasks, and enhanced generalization across diverse robotic scenarios. All the related resources, including research papers, open-source data, and projects, are collected for the community in https://github.com/GAI4Manipulation/AwesomeGAIManipulation

Generative Artificial Intelligence in Robotic Manipulation: A Survey

TL;DR

The survey addresses data efficiency, long-horizon planning, and cross-environment generalization in robotic manipulation by systematically reviewing how generative learning models can generate data, model world dynamics, and synthesize policies. It introduces a three-layer taxonomy—Foundation (data/reward generation), Intermediate (language/code/visual/state generation), and Policy (grasp/trajectory generation)—to organize a broad spectrum of methods including GANs, VAEs, diffusion models, probabilistic flows, and autoregressive models. The work compiles representative methods and discusses challenges such as data scarcity, sim-to-real transfer, benchmark fragmentation, and physical-law awareness, while outlining concrete directions like domain grounding, unified benchmarks, and physics-informed learning. The practical impact lies in guiding researchers toward scalable data pipelines, multi-modal policy learning, and robust, generalizable robotic manipulation systems across real-world environments.

Abstract

This survey provides a comprehensive review on recent advancements of generative learning models in robotic manipulation, addressing key challenges in the field. Robotic manipulation faces critical bottlenecks, including significant challenges in insufficient data and inefficient data acquisition, long-horizon and complex task planning, and the multi-modality reasoning ability for robust policy learning performance across diverse environments. To tackle these challenges, this survey introduces several generative model paradigms, including Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), diffusion models, probabilistic flow models, and autoregressive models, highlighting their strengths and limitations. The applications of these models are categorized into three hierarchical layers: the Foundation Layer, focusing on data generation and reward generation; the Intermediate Layer, covering language, code, visual, and state generation; and the Policy Layer, emphasizing grasp generation and trajectory generation. Each layer is explored in detail, along with notable works that have advanced the state of the art. Finally, the survey outlines future research directions and challenges, emphasizing the need for improved efficiency in data utilization, better handling of long-horizon tasks, and enhanced generalization across diverse robotic scenarios. All the related resources, including research papers, open-source data, and projects, are collected for the community in https://github.com/GAI4Manipulation/AwesomeGAIManipulation

Paper Structure

This paper contains 25 sections, 7 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Overview of this survey. Versatile generative models in robotic manipulation.
  • Figure 2: Overview of generative models.
  • Figure 3: Overview of generative models for robotic manipulation. The taxonomy shows three main categories, each with detailed sub-categories describing specific approaches and methodologies in robotic manipulation.
  • Figure 4: Overview of natural language generative methods for manipulation. Natural language generation could be used for task decomposition, and external memory could be introduced to save and retrieve pre-defined skills. Physically-grounded language generation considers the available operations in the current space. Vision-language-action models generate the action along with the language.
  • Figure 5: Overview of code generation. Research in robotic code generation can be broadly categorized into three approaches: direct code generation, decomposition-based code generation, and constraint-based code generation.
  • ...and 3 more figures