Table of Contents
Fetching ...

RoCap: A Robotic Data Collection Pipeline for the Pose Estimation of Appearance-Changing Objects

Jiahao Nick Li, Toby Chong, Zhongyi Zhou, Hironori Yoshida, Koji Yatani, Xiang 'Anthony' Chen, Takeo Igarashi

TL;DR

RoCap addresses the challenge of 6D pose estimation for appearance-changing objects by introducing a robotic data-collection pipeline. It leverages a 6-DoF robot arm and eye-to-hand calibration to generate ground-truth poses while capturing diverse object configurations, including deformable, transparent, reflective, and articulated states. The approach uses SAM-based masking, pose-aware data collection, and a simple CNN-based orientation model to demonstrate feasibility and to compare with a synthetic-data baseline like Gen6D, highlighting improvements and current limitations. The work demonstrates the practicality of automated, scalable data collection for challenging objects, with potential impact on MR, robotics, and object-tracking research by providing labeled datasets and a replicable pipeline for real-world use.

Abstract

Object pose estimation plays a vital role in mixed-reality interactions when users manipulate tangible objects as controllers. Traditional vision-based object pose estimation methods leverage 3D reconstruction to synthesize training data. However, these methods are designed for static objects with diffuse colors and do not work well for objects that change their appearance during manipulation, such as deformable objects like plush toys, transparent objects like chemical flasks, reflective objects like metal pitchers, and articulated objects like scissors. To address this limitation, we propose Rocap, a robotic pipeline that emulates human manipulation of target objects while generating data labeled with ground truth pose information. The user first gives the target object to a robotic arm, and the system captures many pictures of the object in various 6D configurations. The system trains a model by using captured images and their ground truth pose information automatically calculated from the joint angles of the robotic arm. We showcase pose estimation for appearance-changing objects by training simple deep-learning models using the collected data and comparing the results with a model trained with synthetic data based on 3D reconstruction via quantitative and qualitative evaluation. The findings underscore the promising capabilities of Rocap.

RoCap: A Robotic Data Collection Pipeline for the Pose Estimation of Appearance-Changing Objects

TL;DR

RoCap addresses the challenge of 6D pose estimation for appearance-changing objects by introducing a robotic data-collection pipeline. It leverages a 6-DoF robot arm and eye-to-hand calibration to generate ground-truth poses while capturing diverse object configurations, including deformable, transparent, reflective, and articulated states. The approach uses SAM-based masking, pose-aware data collection, and a simple CNN-based orientation model to demonstrate feasibility and to compare with a synthetic-data baseline like Gen6D, highlighting improvements and current limitations. The work demonstrates the practicality of automated, scalable data collection for challenging objects, with potential impact on MR, robotics, and object-tracking research by providing labeled datasets and a replicable pipeline for real-world use.

Abstract

Object pose estimation plays a vital role in mixed-reality interactions when users manipulate tangible objects as controllers. Traditional vision-based object pose estimation methods leverage 3D reconstruction to synthesize training data. However, these methods are designed for static objects with diffuse colors and do not work well for objects that change their appearance during manipulation, such as deformable objects like plush toys, transparent objects like chemical flasks, reflective objects like metal pitchers, and articulated objects like scissors. To address this limitation, we propose Rocap, a robotic pipeline that emulates human manipulation of target objects while generating data labeled with ground truth pose information. The user first gives the target object to a robotic arm, and the system captures many pictures of the object in various 6D configurations. The system trains a model by using captured images and their ground truth pose information automatically calculated from the joint angles of the robotic arm. We showcase pose estimation for appearance-changing objects by training simple deep-learning models using the collected data and comparing the results with a model trained with synthetic data based on 3D reconstruction via quantitative and qualitative evaluation. The findings underscore the promising capabilities of Rocap.
Paper Structure (34 sections, 2 equations, 10 figures, 1 table)

This paper contains 34 sections, 2 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: 3D reconstructed results for a transparent flask.
  • Figure 2: Example objects for each category that RoCap is focusing on, Viewing-angle dependent: (1) flask, (2) water bottle and, (3) pitcher, Deformable: (4) flexible frog and (5) stiff anpanman, Articulated: (6) scissors, (7) spray head and (8) clamp.
  • Figure 3: Overview of RoCap. RoCap pipeline consists of camera calibration (§\ref{['sec:calibration']}), data capturing (§\ref{['sec:data_collection']}), data labeling (§\ref{['sec:labeling']}), data processing (§\ref{['sec:processing']}) and data augmentation (§\ref{['sec:processing']}). By training on an existing deep learning framework, RoCap achieves object segmentation, state classification and pose estimation for appearance-changing objects.
  • Figure 4: Illustration of the eye-to-hand camera calibration (a). The robotic arm grip a checkerboard and move to multiple positions and orientations for an accurate calibration (b).
  • Figure 5: Pose coverage in RoCap capturing pipeline.
  • ...and 5 more figures