Table of Contents
Fetching ...

Persistent Object Gaussian Splat (POGS) for Tracking Human and Robot Manipulation of Irregularly Shaped Objects

Justin Yu, Kush Hari, Karim El-Refai, Arnav Dalal, Justin Kerr, Chung Min Kim, Richard Cheng, Muhammad Zubair Irshad, Ken Goldberg

TL;DR

POGS introduces a persistent object representation that updates online for unseen irregular objects using a combination of language-grounded, grouping, and self-supervised features embedded in a 3D Gaussian field. It can be trained from a multi-view scene and tracked with a single stereo camera, eliminating the need for CAD models or full re-scans. The approach supports open-vocabulary queries for grasping and manipulation and updates pose estimates as objects move, including human perturbations and tool servoing. Experiments show average pose error of $2.92$ cm, up to $12$ consecutive resets, and tool perturbation recovery rates up to $80\%$ for perturbations up to $30^{\circ}$.

Abstract

Tracking and manipulating irregularly-shaped, previously unseen objects in dynamic environments is important for robotic applications in manufacturing, assembly, and logistics. Recently introduced Gaussian Splats efficiently model object geometry, but lack persistent state estimation for task-oriented manipulation. We present Persistent Object Gaussian Splat (POGS), a system that embeds semantics, self-supervised visual features, and object grouping features into a compact representation that can be continuously updated to estimate the pose of scanned objects. POGS updates object states without requiring expensive rescanning or prior CAD models of objects. After an initial multi-view scene capture and training phase, POGS uses a single stereo camera to integrate depth estimates along with self-supervised vision encoder features for object pose estimation. POGS supports grasping, reorientation, and natural language-driven manipulation by refining object pose estimates, facilitating sequential object reset operations with human-induced object perturbations and tool servoing, where robots recover tool pose despite tool perturbations of up to 30°. POGS achieves up to 12 consecutive successful object resets and recovers from 80% of in-grasp tool perturbations.

Persistent Object Gaussian Splat (POGS) for Tracking Human and Robot Manipulation of Irregularly Shaped Objects

TL;DR

POGS introduces a persistent object representation that updates online for unseen irregular objects using a combination of language-grounded, grouping, and self-supervised features embedded in a 3D Gaussian field. It can be trained from a multi-view scene and tracked with a single stereo camera, eliminating the need for CAD models or full re-scans. The approach supports open-vocabulary queries for grasping and manipulation and updates pose estimates as objects move, including human perturbations and tool servoing. Experiments show average pose error of cm, up to consecutive resets, and tool perturbation recovery rates up to for perturbations up to .

Abstract

Tracking and manipulating irregularly-shaped, previously unseen objects in dynamic environments is important for robotic applications in manufacturing, assembly, and logistics. Recently introduced Gaussian Splats efficiently model object geometry, but lack persistent state estimation for task-oriented manipulation. We present Persistent Object Gaussian Splat (POGS), a system that embeds semantics, self-supervised visual features, and object grouping features into a compact representation that can be continuously updated to estimate the pose of scanned objects. POGS updates object states without requiring expensive rescanning or prior CAD models of objects. After an initial multi-view scene capture and training phase, POGS uses a single stereo camera to integrate depth estimates along with self-supervised vision encoder features for object pose estimation. POGS supports grasping, reorientation, and natural language-driven manipulation by refining object pose estimates, facilitating sequential object reset operations with human-induced object perturbations and tool servoing, where robots recover tool pose despite tool perturbations of up to 30°. POGS achieves up to 12 consecutive successful object resets and recovers from 80% of in-grasp tool perturbations.

Paper Structure

This paper contains 17 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Autonomous Object Manipulation and Tracking with POGS Unified Representation (Top) A robot autonomously performs a pick and place primitive to move the shoe onto a shoebox given input natural language pick query "shoe" and place query "shoebox". (Bottom) A POGS unified representation enables language querying, grasp sampling, and continuous tracking of irregular objects as they move.
  • Figure 2: POGS Pipeline After capturing multiple images of a scene using a robot wrist-mounted ZED mini, POGS segments objects using Detic, extracts DINO features, and embeds language through CLIP. Training images are used to optimize a 3DGS, and features extracted from 2D foundation models are distilled into feature fields, producing our POGS unified representation. During robot object reset and tool servoing, the POGS is updated based on depth geometry and DINO tracking features.
  • Figure 3: Occluded Grasp Sampling POGS is capable of sampling and performing robot grasps on geometry that is fully occluded from the observation camera view (shown). The drill handle is fully occluded by the motor body, yet our POGS unified representation enables handle grasping based on previously observed geometry.
  • Figure 4: Object Reset Experimental SetupMiddle: A human randomly perturbs the configuration of the tracked objects according to the two tiers. Right: A robot arm then plans a grasp on language-queried objects and performs object reset. This process repeats until errors in object state estimation are too high to recover for grasping.
  • Figure 5: Tool Servoing Experimental Setup The robot continuously attempts to align the tracked tool with the target. Top: A human perturbs the tracked tool while in the robot's gripper. The robot adjusts its end-effector position with closed-loop control to re-align the object with the target. Bottom: As a human shifts and rotates the target into new poses, the robot moves so the tool follows the target while maintaining alignment.