Table of Contents
Fetching ...

Pixel2Catch: Multi-Agent Sim-to-Real Transfer for Agile Manipulation with a Single RGB Camera

Seongyong Kim, Junhyeon Cho, Kang-Won Lee, Soo-Chul Lim

TL;DR

To achieve stable learning in a high-DoF system composed of a robot arm equipped with a multi-fingered hand, this work designs a heterogeneous multi-agent reinforcement learning framework that defines the arm and hand as independent agents with distinct roles.

Abstract

To catch a thrown object, a robot must be able to perceive the object's motion and generate control actions in a timely manner. Rather than explicitly estimating the object's 3D position, this work focuses on a novel approach that recognizes object motion using pixel-level visual information extracted from a single RGB image. Such visual cues capture changes in the object's position and scale, allowing the policy to reason about the object's motion. Furthermore, to achieve stable learning in a high-DoF system composed of a robot arm equipped with a multi-fingered hand, we design a heterogeneous multi-agent reinforcement learning framework that defines the arm and hand as independent agents with distinct roles. Each agent is trained cooperatively using role-specific observations and rewards, and the learned policies are successfully transferred from simulation to the real world.

Pixel2Catch: Multi-Agent Sim-to-Real Transfer for Agile Manipulation with a Single RGB Camera

TL;DR

To achieve stable learning in a high-DoF system composed of a robot arm equipped with a multi-fingered hand, this work designs a heterogeneous multi-agent reinforcement learning framework that defines the arm and hand as independent agents with distinct roles.

Abstract

To catch a thrown object, a robot must be able to perceive the object's motion and generate control actions in a timely manner. Rather than explicitly estimating the object's 3D position, this work focuses on a novel approach that recognizes object motion using pixel-level visual information extracted from a single RGB image. Such visual cues capture changes in the object's position and scale, allowing the policy to reason about the object's motion. Furthermore, to achieve stable learning in a high-DoF system composed of a robot arm equipped with a multi-fingered hand, we design a heterogeneous multi-agent reinforcement learning framework that defines the arm and hand as independent agents with distinct roles. Each agent is trained cooperatively using role-specific observations and rewards, and the learned policies are successfully transferred from simulation to the real world.
Paper Structure (21 sections, 4 equations, 6 figures, 3 tables)

This paper contains 21 sections, 4 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: We propose Pixel2Catch, an RGB-only robotic catching system without explicit 3D position estimation. The system consists of a robot arm equipped with a multi-fingered hand and a single RGB camera. Inspired by human visual perception, object motion is inferred from pixel-level features in image space rather than metric 3D coordinates. Policies trained in simulation are transferred directly to the real robot without fine-tuning.
  • Figure 2: Pipeline of the system and experimental setup. Each policy ($\pi_{arm}, \pi_{hand}$) operates on selected observations from two consecutive timesteps. Privileged information is used only during value network training. A single RGB camera is mounted 0.5 m behind and 2.2 m above the robot. The arm and hand are controlled by separate policies that are trained collaboratively to catch a thrown object.
  • Figure 3: (a) Objects used for training (top), validation (middle) in simulation, and real-world experiments (bottom). (b) Random object trajectories generated in simulation, shown without robot motion to highlight object dynamics.
  • Figure 4: Visualization of pixel-level features in simulation and real-world environments. A bounding box is generated around the object in the RGB image, from which the corner and center coordinates, as well as width and height, are extracted. The final input features include these values and their temporal differences.
  • Figure 5: Tracking and success rates over training. Results are averaged over 3 seeds, and the shaded regions indicate the standard deviation.
  • ...and 1 more figures