Table of Contents
Fetching ...

DeLTa: Demonstration and Language-Guided Novel Transparent Object Manipulation

Taeyeop Lee, Gyuree Kang, Bowen Wen, Youngho Kim, Seunghyeok Back, In So Kweon, David Hyunchul Shim, Kuk-Jin Yoon

TL;DR

The paper tackles robust transparent-object manipulation, where unreliable depth sensing and long-horizon precision are required. It introduces DeLTa, a framework integrating stereo depth estimation, 6D pose estimation, and vision-language planning influenced by a single demonstration. Key contributions include 4D hand-object interaction modeling from human videos, a demonstration-based trajectory database, an VLM-grounded task planner with plan grounding, and a last-inch motion planner for safe, collision-aware execution. Empirical results in real-world setups show superior performance on long-horizon tasks compared to strong baselines, highlighting practical impact for real-world human-robot collaboration.

Abstract

Despite the prevalence of transparent object interactions in human everyday life, transparent robotic manipulation research remains limited to short-horizon tasks and basic grasping capabilities.Although some methods have partially addressed these issues, most of them have limitations in generalizability to novel objects and are insufficient for precise long-horizon robot manipulation. To address this limitation, we propose DeLTa (Demonstration and Language-Guided Novel Transparent Object Manipulation), a novel framework that integrates depth estimation, 6D pose estimation, and vision-language planning for precise long-horizon manipulation of transparent objects guided by natural task instructions. A key advantage of our method is its single-demonstration approach, which generalizes 6D trajectories to novel transparent objects without requiring category-level priors or additional training. Additionally, we present a task planner that refines the VLM-generated plan to account for the constraints of a single-arm, eye-in-hand robot for long-horizon object manipulation tasks. Through comprehensive evaluation, we demonstrate that our method significantly outperforms existing transparent object manipulation approaches, particularly in long-horizon scenarios requiring precise manipulation capabilities. Project page: https://sites.google.com/view/DeLTa25/

DeLTa: Demonstration and Language-Guided Novel Transparent Object Manipulation

TL;DR

The paper tackles robust transparent-object manipulation, where unreliable depth sensing and long-horizon precision are required. It introduces DeLTa, a framework integrating stereo depth estimation, 6D pose estimation, and vision-language planning influenced by a single demonstration. Key contributions include 4D hand-object interaction modeling from human videos, a demonstration-based trajectory database, an VLM-grounded task planner with plan grounding, and a last-inch motion planner for safe, collision-aware execution. Empirical results in real-world setups show superior performance on long-horizon tasks compared to strong baselines, highlighting practical impact for real-world human-robot collaboration.

Abstract

Despite the prevalence of transparent object interactions in human everyday life, transparent robotic manipulation research remains limited to short-horizon tasks and basic grasping capabilities.Although some methods have partially addressed these issues, most of them have limitations in generalizability to novel objects and are insufficient for precise long-horizon robot manipulation. To address this limitation, we propose DeLTa (Demonstration and Language-Guided Novel Transparent Object Manipulation), a novel framework that integrates depth estimation, 6D pose estimation, and vision-language planning for precise long-horizon manipulation of transparent objects guided by natural task instructions. A key advantage of our method is its single-demonstration approach, which generalizes 6D trajectories to novel transparent objects without requiring category-level priors or additional training. Additionally, we present a task planner that refines the VLM-generated plan to account for the constraints of a single-arm, eye-in-hand robot for long-horizon object manipulation tasks. Through comprehensive evaluation, we demonstrate that our method significantly outperforms existing transparent object manipulation approaches, particularly in long-horizon scenarios requiring precise manipulation capabilities. Project page: https://sites.google.com/view/DeLTa25/

Paper Structure

This paper contains 15 sections, 5 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: DeLTa for Transparent Object Manipulation.
  • Figure 2: Overview of our DeLTa framework.
  • Figure 3: Comparison of ZED camera's depth and reconstructed depth.
  • Figure 4: VLM prompting process. Human first provides the robot with a task description in natural language. Robot then formulates a templated prompt and inquires VLM for responses. Blue: context information including robot state, primitive actions, and environment state. Orange: task-dependent prompts. Red: error messages and invalid plans.
  • Figure 5: Last-Inch Motion Planner: Pouring (Left) and Pick (Right). RGB axes visualize planned end-effector poses. Blue boxes represent the approximated collision map derived from the point cloud.
  • ...and 2 more figures