Table of Contents
Fetching ...

OCRA: Object-Centric Learning with 3D and Tactile Priors for Human-to-Robot Action Transfer

Kuanning Wang, Ke Fan, Yuqian Fu, Siyu Lin, Hu Luo, Daniel Seita, Yanwei Fu, Yu-Gang Jiang, Xiangyang Xue

Abstract

We present OCRA, an Object-Centric framework for video-based human-to-Robot Action transfer that learns directly from human demonstration videos to enable robust manipulation. Object-centric learning emphasizes task-relevant objects and their interactions while filtering out irrelevant background, providing a natural and scalable way to teach robots. OCRA leverages multi-view RGB videos, the state-of-the-art 3D foundation model VGGT, and advanced detection and segmentation models to reconstruct object-centric 3D point clouds, capturing rich interactions between objects. To handle properties not easily perceived by vision alone, we incorporate tactile priors via a large-scale dataset of over one million tactile images. These 3D and tactile priors are fused through a multimodal module (ResFiLM) and fed into a Diffusion Policy to generate robust manipulation actions. Extensive experiments on both vision-only and visuo-tactile tasks show that OCRA significantly outperforms existing baselines and ablations, demonstrating its effectiveness for learning from human demonstration videos.

OCRA: Object-Centric Learning with 3D and Tactile Priors for Human-to-Robot Action Transfer

Abstract

We present OCRA, an Object-Centric framework for video-based human-to-Robot Action transfer that learns directly from human demonstration videos to enable robust manipulation. Object-centric learning emphasizes task-relevant objects and their interactions while filtering out irrelevant background, providing a natural and scalable way to teach robots. OCRA leverages multi-view RGB videos, the state-of-the-art 3D foundation model VGGT, and advanced detection and segmentation models to reconstruct object-centric 3D point clouds, capturing rich interactions between objects. To handle properties not easily perceived by vision alone, we incorporate tactile priors via a large-scale dataset of over one million tactile images. These 3D and tactile priors are fused through a multimodal module (ResFiLM) and fed into a Diffusion Policy to generate robust manipulation actions. Extensive experiments on both vision-only and visuo-tactile tasks show that OCRA significantly outperforms existing baselines and ablations, demonstrating its effectiveness for learning from human demonstration videos.
Paper Structure (21 sections, 3 equations, 6 figures, 5 tables)

This paper contains 21 sections, 3 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of OCRA. OCRA leverages object-centric learning from multi-view human demonstration videos and tactile sensing to enable robust execution of diverse manipulation tasks transferred from humans to robots.
  • Figure 2: Framework.The left column illustrates our human demonstration collection system. Two RGB cameras capture demonstration videos, while the blue box highlights a portable tactile gripper for tactile data collection, which also records fingertip tactile images used to build our large-scale tactile dataset (shown at the bottom). The first row depicts how OCRA processes multi-view RGB inputs to obtain object-centric 3D priors. We first reconstruct the 3D scene using VGGT, followed by bi-view metric depth prediction for world-scale alignment. GroundingDINO and SAM2 then provide object segmentation masks, divided into a Manipulable Object Mask (for target objects) and a Context Object Mask (for surrounding objects). These are used to extract visual object-centric representations across modalities (segmentation, point cloud). The middle of the second row shows tactile-prior extraction via Tactile Encoder pretraining under a Masked Autoencoder paradigm. The right of the second row presents policy deployment. Multi-view RGB and tactile images are encoded into geometric and tactile features, which are fused by ResFiLM and passed to a Diffusion Policy. The policy predicts actions through iterative denoising of noisy action samples.
  • Figure 3: Experiments Visualization. The first row shows a human demonstrator using either a hand or a portable gripper to collect demonstrations. The left four tasks are vision-only, while the right three are visuo-tactile. The following rows show the robot’s execution of our policy. The last two columns illustrate visuo-tactile tasks under different input conditions: in the Robot Deployment subfigures, the left and right insets correspond to different execution attempts, with arrows of different colors indicating motion directions. In the Weight Sorting task, cups are guided to distinct target locations based on mass; in the Texture Sorting task, objects are sorted by surface texture. For the Texture Sorting column (column 7, inset 3), we additionally show tactile images from grasps on different textures. These results demonstrate that vision alone is insufficient for reliable discrimination, highlighting the critical role of tactile perception in accurate decision-making.
  • Figure 4: Experimental setup for our system. For the right part, the top half shows the robot and portable gripper, which uses the same tactile device. The bottom half shows our experimental objects. We display the camera setting in Fig. \ref{['fig:framework']}. For the left part, the image shows a subset of the objects we used to collect our large-scale tactile image dataset.
  • Figure 5: Visualization of camera views. We use the two cameras marked by green boxes during both the demonstration collection and testing. To evaluate the view generalization ability of OCRA, the left green-boxed camera remains fixed, while the camera in both green and yellow boxes is replaced according to the arrow direction.
  • ...and 1 more figures