Table of Contents
Fetching ...

RoboPanoptes: The All-seeing Robot with Whole-body Dexterity

Xiaomeng Xu, Dominik Bauer, Shuran Song

TL;DR

RoboPanoptes tackles the limits of end-effector-centric manipulation by introducing whole-body dexterity powered by whole-body vision. It combines a modular, scalable hardware design with 21 cameras distributed over the body and a whole-body visuomotor policy based on diffusion transformers and cross-attention to learn manipulation skills from demonstrations. Key innovations include view-dependent positional encoding, blink training for sensor robustness, and a leader-follower teleoperation interface to collect diverse data. Empirical results across unboxing, sweeping, and stowing tasks show RoboPanoptes outperforms baselines in accuracy, efficiency, and resilience, suggesting strong practical potential for dexterous manipulation in cluttered or constrained environments.

Abstract

We present RoboPanoptes, a capable yet practical robot system that achieves whole-body dexterity through whole-body vision. Its whole-body dexterity allows the robot to utilize its entire body surface for manipulation, such as leveraging multiple contact points or navigating constrained spaces. Meanwhile, whole-body vision uses a camera system distributed over the robot's surface to provide comprehensive, multi-perspective visual feedback of its own and the environment's state. At its core, RoboPanoptes uses a whole-body visuomotor policy that learns complex manipulation skills directly from human demonstrations, efficiently aggregating information from the distributed cameras while maintaining resilience to sensor failures. Together, these design aspects unlock new capabilities and tasks, allowing RoboPanoptes to unbox in narrow spaces, sweep multiple or oversized objects, and succeed in multi-step stowing in cluttered environments, outperforming baselines in adaptability and efficiency. Results are best viewed on https://robopanoptes.github.io.

RoboPanoptes: The All-seeing Robot with Whole-body Dexterity

TL;DR

RoboPanoptes tackles the limits of end-effector-centric manipulation by introducing whole-body dexterity powered by whole-body vision. It combines a modular, scalable hardware design with 21 cameras distributed over the body and a whole-body visuomotor policy based on diffusion transformers and cross-attention to learn manipulation skills from demonstrations. Key innovations include view-dependent positional encoding, blink training for sensor robustness, and a leader-follower teleoperation interface to collect diverse data. Empirical results across unboxing, sweeping, and stowing tasks show RoboPanoptes outperforms baselines in accuracy, efficiency, and resilience, suggesting strong practical potential for dexterous manipulation in cluttered or constrained environments.

Abstract

We present RoboPanoptes, a capable yet practical robot system that achieves whole-body dexterity through whole-body vision. Its whole-body dexterity allows the robot to utilize its entire body surface for manipulation, such as leveraging multiple contact points or navigating constrained spaces. Meanwhile, whole-body vision uses a camera system distributed over the robot's surface to provide comprehensive, multi-perspective visual feedback of its own and the environment's state. At its core, RoboPanoptes uses a whole-body visuomotor policy that learns complex manipulation skills directly from human demonstrations, efficiently aggregating information from the distributed cameras while maintaining resilience to sensor failures. Together, these design aspects unlock new capabilities and tasks, allowing RoboPanoptes to unbox in narrow spaces, sweep multiple or oversized objects, and succeed in multi-step stowing in cluttered environments, outperforming baselines in adaptability and efficiency. Results are best viewed on https://robopanoptes.github.io.
Paper Structure (16 sections, 8 figures)

This paper contains 16 sections, 8 figures.

Figures (8)

  • Figure 1: RoboPanoptes, a robot that utilizes all of its body parts to sense and interact with its environment. Whole-body vision (via 21 cameras distributed over the robot's body) enables whole-body dexterity, with the robot utilizing its entire surface for manipulation. This design enables new robot capabilities such as a) simultaneously sweeping multiple small objects, b) moving large objects using whole-body contact, c) unboxing in constrained, narrow spaces, and d) executing precise multi-step stowing in cluttered environments.
  • Figure 2: Modular Hardware Design including a) a body module consisting of an actuator, two cameras, and wire fixtures, as well as b) a head module with five cameras and an LED light.
  • Figure 3: Data Collection Interface. The operator uses both hands to control the leader robot, whose joint angles are sent to the follower robot in real-time as position targets. The joint angles of the leader robot are recorded as target actions, while the images and joint angles of the follower robot are recorded as observations.
  • Figure 4: Whole-body Visuomotor Policy leverages whole-body vision for whole-body dexterity. Left: The current robot and environment state is observed via RoboPanoptes' 21 cameras and its 9 joints angles, converted to 21 camera poses using forward kinematics. Each image (green) is represented by the class token of a vision foundation model. Each camera pose (purple) is embedded using our view-dependent positional encoding. The concatenation of each camera's image and pose tokens yields a whole-body vision token; 21 in total. Middle: Our whole-body visuomotor policy consumes these vision tokens, proprioception, and denoising-step tokens as condition via cross attention. We diffuse $T$ whole-body dexterity tokens (blue), each corresponding to an action time step. Right: Per time step, we project the predicted dexterity token to the 9 joint angles to be achieved by the dexterity action.
  • Figure 5: Alternative Observation Spaces. We compare RoboPanoptes with several baselines, including using a) only the head camera, b) the four neck cameras, and c) a top-down camera. Variants using all of RoboPanoptes' cameras but without view-dependent positional encoding or without blink training serve as ablations of our design.
  • ...and 3 more figures