RoCap: A Robotic Data Collection Pipeline for the Pose Estimation of Appearance-Changing Objects
Jiahao Nick Li, Toby Chong, Zhongyi Zhou, Hironori Yoshida, Koji Yatani, Xiang 'Anthony' Chen, Takeo Igarashi
TL;DR
RoCap addresses the challenge of 6D pose estimation for appearance-changing objects by introducing a robotic data-collection pipeline. It leverages a 6-DoF robot arm and eye-to-hand calibration to generate ground-truth poses while capturing diverse object configurations, including deformable, transparent, reflective, and articulated states. The approach uses SAM-based masking, pose-aware data collection, and a simple CNN-based orientation model to demonstrate feasibility and to compare with a synthetic-data baseline like Gen6D, highlighting improvements and current limitations. The work demonstrates the practicality of automated, scalable data collection for challenging objects, with potential impact on MR, robotics, and object-tracking research by providing labeled datasets and a replicable pipeline for real-world use.
Abstract
Object pose estimation plays a vital role in mixed-reality interactions when users manipulate tangible objects as controllers. Traditional vision-based object pose estimation methods leverage 3D reconstruction to synthesize training data. However, these methods are designed for static objects with diffuse colors and do not work well for objects that change their appearance during manipulation, such as deformable objects like plush toys, transparent objects like chemical flasks, reflective objects like metal pitchers, and articulated objects like scissors. To address this limitation, we propose Rocap, a robotic pipeline that emulates human manipulation of target objects while generating data labeled with ground truth pose information. The user first gives the target object to a robotic arm, and the system captures many pictures of the object in various 6D configurations. The system trains a model by using captured images and their ground truth pose information automatically calculated from the joint angles of the robotic arm. We showcase pose estimation for appearance-changing objects by training simple deep-learning models using the collected data and comparing the results with a model trained with synthetic data based on 3D reconstruction via quantitative and qualitative evaluation. The findings underscore the promising capabilities of Rocap.
