Table of Contents
Fetching ...

Dex4D: Task-Agnostic Point Track Policy for Sim-to-Real Dexterous Manipulation

Yuxuan Kuang, Sungjae Park, Katerina Fragkiadaki, Shubham Tulsiani

TL;DR

Dex4D is proposed, a framework that enables zero-shot deployment for diverse dexterous manipulation tasks and yields consistent improvements over prior baselines, and strong generalization to novel objects, scene layouts, backgrounds, and trajectories is demonstrated, highlighting the robustness and scalability of the proposed framework.

Abstract

Learning generalist policies capable of accomplishing a plethora of everyday tasks remains an open challenge in dexterous manipulation. In particular, collecting large-scale manipulation data via real-world teleoperation is expensive and difficult to scale. While learning in simulation provides a feasible alternative, designing multiple task-specific environments and rewards for training is similarly challenging. We propose Dex4D, a framework that instead leverages simulation for learning task-agnostic dexterous skills that can be flexibly recomposed to perform diverse real-world manipulation tasks. Specifically, Dex4D learns a domain-agnostic 3D point track conditioned policy capable of manipulating any object to any desired pose. We train this 'Anypose-to-Anypose' policy in simulation across thousands of objects with diverse pose configurations, covering a broad space of robot-object interactions that can be composed at test time. At deployment, this policy can be zero-shot transferred to real-world tasks without finetuning, simply by prompting it with desired object-centric point tracks extracted from generated videos. During execution, Dex4D uses online point tracking for closed-loop perception and control. Extensive experiments in simulation and on real robots show that our method enables zero-shot deployment for diverse dexterous manipulation tasks and yields consistent improvements over prior baselines. Furthermore, we demonstrate strong generalization to novel objects, scene layouts, backgrounds, and trajectories, highlighting the robustness and scalability of the proposed framework.

Dex4D: Task-Agnostic Point Track Policy for Sim-to-Real Dexterous Manipulation

TL;DR

Dex4D is proposed, a framework that enables zero-shot deployment for diverse dexterous manipulation tasks and yields consistent improvements over prior baselines, and strong generalization to novel objects, scene layouts, backgrounds, and trajectories is demonstrated, highlighting the robustness and scalability of the proposed framework.

Abstract

Learning generalist policies capable of accomplishing a plethora of everyday tasks remains an open challenge in dexterous manipulation. In particular, collecting large-scale manipulation data via real-world teleoperation is expensive and difficult to scale. While learning in simulation provides a feasible alternative, designing multiple task-specific environments and rewards for training is similarly challenging. We propose Dex4D, a framework that instead leverages simulation for learning task-agnostic dexterous skills that can be flexibly recomposed to perform diverse real-world manipulation tasks. Specifically, Dex4D learns a domain-agnostic 3D point track conditioned policy capable of manipulating any object to any desired pose. We train this 'Anypose-to-Anypose' policy in simulation across thousands of objects with diverse pose configurations, covering a broad space of robot-object interactions that can be composed at test time. At deployment, this policy can be zero-shot transferred to real-world tasks without finetuning, simply by prompting it with desired object-centric point tracks extracted from generated videos. During execution, Dex4D uses online point tracking for closed-loop perception and control. Extensive experiments in simulation and on real robots show that our method enables zero-shot deployment for diverse dexterous manipulation tasks and yields consistent improvements over prior baselines. Furthermore, we demonstrate strong generalization to novel objects, scene layouts, backgrounds, and trajectories, highlighting the robustness and scalability of the proposed framework.
Paper Structure (35 sections, 7 equations, 7 figures, 8 tables)

This paper contains 35 sections, 7 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Overview of our Dex4D teacher and student network architectures. (a) We first learn a teacher policy via RL with privileged states and full points sampled on the whole object, leveraging our proposed Paired Point Encoding representation. (b) Given partial observation, i.e., robot proprioception, last action, and masked paired points, we distill from the teacher and learn a transformer-based student action world model that jointly predicts actions and future robot states.
  • Figure 2: Comparison between our Paired Point Encoding with other representations. Point features encoded from our Paired Point Encoding keep correspondence and permutation-invariance of the current and target object points, which shows better performance for policy learning.
  • Figure 3: Mean reward curve of the first two stages of teacher training. Step 15k is the curriculum boundary. Our method outperforms both ablation variants.
  • Figure 4: Overview of real-world dexterous manipulation tasks. Two frames are shown in each column for each task.
  • Figure 5: Qualitative comparison between our method and the baseline. The baseline method suffers from object dropping and inaccurate post-grasping movement due to the lack of hand feedback and vulnerability to few and noisy visible points, while our method performs robustly.
  • ...and 2 more figures