Table of Contents
Fetching ...

Object and Contact Point Tracking in Demonstrations Using 3D Gaussian Splatting

Michael Büttner, Jonathan Francis, Helge Rhodin, Andrew Melnik

TL;DR

This paper introduces a method to enhance Interactive Imitation Learning by extracting touch interaction points and tracking object movement from video demonstrations, which lays the foundation for more effective task learning and execution in autonomous robotic systems.

Abstract

This paper introduces a method to enhance Interactive Imitation Learning (IIL) by extracting touch interaction points and tracking object movement from video demonstrations. The approach extends current IIL systems by providing robots with detailed knowledge of both where and how to interact with objects, particularly complex articulated ones like doors and drawers. By leveraging cutting-edge techniques such as 3D Gaussian Splatting and FoundationPose for tracking, this method allows robots to better understand and manipulate objects in dynamic environments. The research lays the foundation for more effective task learning and execution in autonomous robotic systems.

Object and Contact Point Tracking in Demonstrations Using 3D Gaussian Splatting

TL;DR

This paper introduces a method to enhance Interactive Imitation Learning by extracting touch interaction points and tracking object movement from video demonstrations, which lays the foundation for more effective task learning and execution in autonomous robotic systems.

Abstract

This paper introduces a method to enhance Interactive Imitation Learning (IIL) by extracting touch interaction points and tracking object movement from video demonstrations. The approach extends current IIL systems by providing robots with detailed knowledge of both where and how to interact with objects, particularly complex articulated ones like doors and drawers. By leveraging cutting-edge techniques such as 3D Gaussian Splatting and FoundationPose for tracking, this method allows robots to better understand and manipulate objects in dynamic environments. The research lays the foundation for more effective task learning and execution in autonomous robotic systems.

Paper Structure

This paper contains 8 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overview of the pipeline. We start with RGB-D recordings of the scene and the demonstration. We train a 3D Gaussian Splatting kerbl3Dgaussians Scene on the scene video, and do object masking on the demonstration video using RAFT teed2020raft and SAM 2 ravi2024sam2. These masks are used to create object masks of the scene video, which in turn are used to create a mesh using GS2Mesh wolf2024gs2mesh and a Gaussian object segmentation using SAGS hu2024semantic. The mesh is used to do 6-DoF tracking with FoundationPose foundationposewen2024, which in turn is used to estimate contact points. Here, the mesh is visualized by MeshLab LocalChapterEvents:ItalChap:ItalianChapConf2008:129-136.
  • Figure 2: In order to estimate touch points, we calculate the absolute difference between the depth image of the LiDAR camera and the rendered depth image, threshold it, and apply the hand mask that was generated using Grounding DINO liu2023grounding + SAM 2 ravi2024sam2. This is done for the first 10 frames of contact and then accumulated to find the most probable points.
  • Figure 3: Visuals for successful tracking and contact point estimation episodes. The red spheres stand for the identified contact points, the green spheres for the starting positions, and the blue spheres stand for the end position.
  • Figure A.1: Detailed overview of the process. Rectangles indicate data while circles indicate processes.
  • Figure A.2: In order to find the bounding box of the moving object, we use RAFT optical flow on all the frames of the demonstration video with a stride of 6, remove the human from the optical flow mask, cluster the bounding boxes by size and choose the frame whose bounding box is closest to the center of the biggest cluster. The human is filtered out using a mask generated by Grounding DINO liu2023grounding and SAM 2 ravi2024sam2.
  • ...and 1 more figures