A Survey of Embodied Learning for Object-Centric Robotic Manipulation

Ying Zheng; Lei Yao; Yuejiao Su; Yi Zhang; Yi Wang; Sicheng Zhao; Yiyi Zhang; Lap-Pui Chau

A Survey of Embodied Learning for Object-Centric Robotic Manipulation

Ying Zheng, Lei Yao, Yuejiao Su, Yi Zhang, Yi Wang, Sicheng Zhao, Yiyi Zhang, Lap-Pui Chau

TL;DR

This survey addresses the problem of enabling robots to manipulate objects through embodied learning, by organizing existing work into three interconnected domains: embodied perceptual learning, embodied policy learning, and embodied task-oriented learning. It provides a structured taxonomy across data representations (image-based, 3D-aware, and tactile), object pose estimation (ILOPE, CLOPE, NOPE), and affordance learning, then surveys policy representations (explicit, implicit, diffusion) and policy learning (RL, IL, hybrids) before detailing object grasping and manipulation tasks, datasets, and evaluation metrics. The paper also surveys applications across industrial, agricultural, domestic, and surgical domains, and discusses challenges such as sim-to-real generalization, multimodal embodied LLMs, human-robot collaboration, model compression, and safety, offering future directions. Overall, the work consolidates cutting-edge developments, highlights practical datasets and benchmarks, and provides a roadmap for advancing robust, generalizable embodied robotic manipulation. A linked repository at https://github.com/RayYoh/OCRM_survey accompanies the survey for reproducibility and community engagement.

Abstract

Embodied learning for object-centric robotic manipulation is a rapidly developing and challenging area in embodied AI. It is crucial for advancing next-generation intelligent robots and has garnered significant interest recently. Unlike data-driven machine learning methods, embodied learning focuses on robot learning through physical interaction with the environment and perceptual feedback, making it especially suitable for robotic manipulation. In this paper, we provide a comprehensive survey of the latest advancements in this field and categorize the existing work into three main branches: 1) Embodied perceptual learning, which aims to predict object pose and affordance through various data representations; 2) Embodied policy learning, which focuses on generating optimal robotic decisions using methods such as reinforcement learning and imitation learning; 3) Embodied task-oriented learning, designed to optimize the robot's performance based on the characteristics of different tasks in object grasping and manipulation. In addition, we offer an overview and discussion of public datasets, evaluation metrics, representative applications, current challenges, and potential future research directions. A project associated with this survey has been established at https://github.com/RayYoh/OCRM_survey.

A Survey of Embodied Learning for Object-Centric Robotic Manipulation

TL;DR

Abstract

Paper Structure (59 sections, 10 equations, 7 figures, 5 tables)

This paper contains 59 sections, 10 equations, 7 figures, 5 tables.

Introduction
Comparison with Recent Surveys
Text Organization
Embodied Perceptual Learning
Data Representation
Image-Based Representation
3D-Aware Representation
Tactile-Based Representation
Discussion
Object Pose Estimation
Instance-Level Object Pose Estimation (ILOPE)
Category-Level Object Pose Estimation (CLOPE)
Novel Object Pose Estimation (NOPE)
Discussion
Affordance Learning
...and 44 more sections

Figures (7)

Figure 1: An illustration of robotic manipulation system (left) and the typology of embodied learning methods for object-centric robotic manipulation (right). EPEL takes the data obtained from sensors such as cameras as input, enhancing the understanding of objects and the environment through interaction. It serves as the basis for EPCL and ETOL. EPCL utilizes the perceptual information provided by EPEL to formulate action strategies for robotic arms and end-effectors like grippers, thereby providing specific operational capabilities for ETOL. ETOL integrates EPEL and EPCL, learning to perform diverse tasks based on the characteristics of different objects. These three closely related learning processes work together to enable robots to accomplish complex tasks.
Figure 2: Conceptual comparison of four image-based representation frameworks. SISB: Single-Image Single-Branch; SIMB: Single-Image Multi-Branch; MISB: Multi-Image Single-Branch; MIMB: Multi-Image Multi-Branch.
Figure 3: Conceptual comparison of three 3D-aware representation frameworks. DR: Depth-based Representation; PR: Point cloud-based Representation; TR: Transition-based Representation.
Figure 4: Visualization of four representative affordance prediction examples from the dataset provided by li2024laso, including bag lift, bottle open, knife grasp, and faucet open. The affordance ground truth labels are highlighted in red.
Figure 5: Illustration of single-object grasping (top row) and multi-object grasping (bottom row). The examples are respectively from the ARNOLD benchmark gong2023arnold and Grasp'Em dataset li2024grasp.
...and 2 more figures

A Survey of Embodied Learning for Object-Centric Robotic Manipulation

TL;DR

Abstract

A Survey of Embodied Learning for Object-Centric Robotic Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)