Table of Contents
Fetching ...

Optimizing Grasping in Legged Robots: A Deep Learning Approach to Loco-Manipulation

Dilermando Almeida, Guilherme Lazzarini, Juliano Negri, Thiago H. Segreto, Ricardo V. Godoy, Marcelo Becker

TL;DR

This work tackles robust grasping for loco-manipulation in quadruped robots by building a sim-to-real pipeline. It generates a large synthetic RGB-D grasp dataset in Genesis, trains a U-Net–style CNN to produce pixel-wise grasp-quality heatmaps from multi-modal inputs, and validates the approach on a real Spot robot with a manipulable end-effector. The key contributions include synthetic data creation with per-pixel grasp labels, a multi-modal grasp predictor, and a complete deployment pipeline from perception to manipulation. The approach demonstrates scalable, autonomous object handling in unstructured environments, with implications for rescue, logistics, and domestic robotics.

Abstract

This paper presents a deep learning framework designed to enhance the grasping capabilities of quadrupeds equipped with arms, with a focus on improving precision and adaptability. Our approach centers on a sim-to-real methodology that minimizes reliance on physical data collection. We developed a pipeline within the Genesis simulation environment to generate a synthetic dataset of grasp attempts on common objects. By simulating thousands of interactions from various perspectives, we created pixel-wise annotated grasp-quality maps to serve as the ground truth for our model. This dataset was used to train a custom CNN with a U-Net-like architecture that processes multi-modal input from an onboard RGB and depth cameras, including RGB images, depth maps, segmentation masks, and surface normal maps. The trained model outputs a grasp-quality heatmap to identify the optimal grasp point. We validated the complete framework on a four-legged robot. The system successfully executed a full loco-manipulation task: autonomously navigating to a target object, perceiving it with its sensors, predicting the optimal grasp pose using our model, and performing a precise grasp. This work proves that leveraging simulated training with advanced sensing offers a scalable and effective solution for object handling.

Optimizing Grasping in Legged Robots: A Deep Learning Approach to Loco-Manipulation

TL;DR

This work tackles robust grasping for loco-manipulation in quadruped robots by building a sim-to-real pipeline. It generates a large synthetic RGB-D grasp dataset in Genesis, trains a U-Net–style CNN to produce pixel-wise grasp-quality heatmaps from multi-modal inputs, and validates the approach on a real Spot robot with a manipulable end-effector. The key contributions include synthetic data creation with per-pixel grasp labels, a multi-modal grasp predictor, and a complete deployment pipeline from perception to manipulation. The approach demonstrates scalable, autonomous object handling in unstructured environments, with implications for rescue, logistics, and domestic robotics.

Abstract

This paper presents a deep learning framework designed to enhance the grasping capabilities of quadrupeds equipped with arms, with a focus on improving precision and adaptability. Our approach centers on a sim-to-real methodology that minimizes reliance on physical data collection. We developed a pipeline within the Genesis simulation environment to generate a synthetic dataset of grasp attempts on common objects. By simulating thousands of interactions from various perspectives, we created pixel-wise annotated grasp-quality maps to serve as the ground truth for our model. This dataset was used to train a custom CNN with a U-Net-like architecture that processes multi-modal input from an onboard RGB and depth cameras, including RGB images, depth maps, segmentation masks, and surface normal maps. The trained model outputs a grasp-quality heatmap to identify the optimal grasp point. We validated the complete framework on a four-legged robot. The system successfully executed a full loco-manipulation task: autonomously navigating to a target object, perceiving it with its sensors, predicting the optimal grasp pose using our model, and performing a precise grasp. This work proves that leveraging simulated training with advanced sensing offers a scalable and effective solution for object handling.

Paper Structure

This paper contains 10 sections, 3 equations, 8 figures.

Figures (8)

  • Figure 1: Illustration of the camera positioning process used. The water bottle is shown at the origin, in blue, and the camera points are represented as red dots, located at y=0.5m, as the x and z position varies between -0.5m and 0.5m.
  • Figure 2: Parallel grasping simulation performed in the Genesis World environment, illustrating five robotic grippers executing simultaneous grasping attempts on a geometric model of a water bottle. Each gripper is aligned with a specific pixel of the object's image from the normal at that point.
  • Figure 3: Representation of the dataset ground truth for mapping grasping points, generated from simulations in the Genesis World environment. Pixels encoded in green indicate successful grasping regions, and pixels in red indicate where grasping failed. Pixels outside the segmentation mask are considered indeterminate.
  • Figure 4: Model used to predict optimal grasping points. The inputs (normal map, depth, segmentation, and RGB image) (left) are processed by the CNN to generate a map of optimal grasping points (right), indicated by the light green region. The optimal point, indicated by the red dot in the output image, is determined by the network.
  • Figure 5: Visualization of input data extracted in the Genesis World ((a) RGB image, (b) segmentation mask, (c) depth map (d) normal map, (e) ground truth of viable pixels for grasping (white for success, gray for failure, and black for no data)) and the corresponding neural network output for those inputs (f)indicating the probability of grasping success per pixel (100% for white, 0% for black).
  • ...and 3 more figures