Table of Contents
Fetching ...

SimToolReal: An Object-Centric Policy for Zero-Shot Dexterous Tool Manipulation

Kushal Kedia, Tyler Ga Wei Lum, Jeannette Bohg, C. Karen Liu

TL;DR

This work procedurally generates a large variety of tool-like object primitives in simulation and trains a single RL policy with the universal goal of manipulating each object to random goal poses, enabling SimToolReal to perform general dexterous tool manipulation at test-time without any object or task-specific training.

Abstract

The ability to manipulate tools significantly expands the set of tasks a robot can perform. Yet, tool manipulation represents a challenging class of dexterity, requiring grasping thin objects, in-hand object rotations, and forceful interactions. Since collecting teleoperation data for these behaviors is challenging, sim-to-real reinforcement learning (RL) is a promising alternative. However, prior approaches typically require substantial engineering effort to model objects and tune reward functions for each task. In this work, we propose SimToolReal, taking a step towards generalizing sim-to-real RL policies for tool manipulation. Instead of focusing on a single object and task, we procedurally generate a large variety of tool-like object primitives in simulation and train a single RL policy with the universal goal of manipulating each object to random goal poses. This approach enables SimToolReal to perform general dexterous tool manipulation at test-time without any object or task-specific training. We demonstrate that SimToolReal outperforms prior retargeting and fixed-grasp methods by 37% while matching the performance of specialist RL policies trained on specific target objects and tasks. Finally, we show that SimToolReal generalizes across a diverse set of everyday tools, achieving strong zero-shot performance over 120 real-world rollouts spanning 24 tasks, 12 object instances, and 6 tool categories.

SimToolReal: An Object-Centric Policy for Zero-Shot Dexterous Tool Manipulation

TL;DR

This work procedurally generates a large variety of tool-like object primitives in simulation and trains a single RL policy with the universal goal of manipulating each object to random goal poses, enabling SimToolReal to perform general dexterous tool manipulation at test-time without any object or task-specific training.

Abstract

The ability to manipulate tools significantly expands the set of tasks a robot can perform. Yet, tool manipulation represents a challenging class of dexterity, requiring grasping thin objects, in-hand object rotations, and forceful interactions. Since collecting teleoperation data for these behaviors is challenging, sim-to-real reinforcement learning (RL) is a promising alternative. However, prior approaches typically require substantial engineering effort to model objects and tune reward functions for each task. In this work, we propose SimToolReal, taking a step towards generalizing sim-to-real RL policies for tool manipulation. Instead of focusing on a single object and task, we procedurally generate a large variety of tool-like object primitives in simulation and train a single RL policy with the universal goal of manipulating each object to random goal poses. This approach enables SimToolReal to perform general dexterous tool manipulation at test-time without any object or task-specific training. We demonstrate that SimToolReal outperforms prior retargeting and fixed-grasp methods by 37% while matching the performance of specialist RL policies trained on specific target objects and tasks. Finally, we show that SimToolReal generalizes across a diverse set of everyday tools, achieving strong zero-shot performance over 120 real-world rollouts spanning 24 tasks, 12 object instances, and 6 tool categories.
Paper Structure (33 sections, 13 equations, 16 figures, 3 tables)

This paper contains 33 sections, 13 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: SimToolReal is a framework for training a single general-purpose, object-centric RL policy in simulation and transferring it to real-world tool use. (Top) Zero-shot deployment on novel real tools and tasks, spanning thin markers to thick hammers. (Bottom) Tool use typically involves grasping objects from flat surfaces, reorienting them in-hand, and performing the task.
  • Figure 2: Overview of SimToolReal. (Top) Training in Simulation: We train a goal-conditioned RL policy in simulation that manipulates a wide variety of procedurally-generated objects to randomly sampled goal poses. (Bottom) Inference in Real: We deploy this policy zero-shot on real-world tools from DexToolBench, following tool trajectories from human videos.
  • Figure 3: Real-World Deployment. (Left) Human Video Processing: We collect an RGB-D human video and process it using vision foundation models. We use SAM 3D chen2025sam to generate a metric-scale object mesh and segment a 3D grasp bounding box. Then, we use FoundationPose wen2024foundationpose to extract a sequence of 6D goal poses. (Right) Inference-Time Pipeline: Our LSTM policy takes in proprioception, object pose, grasp bounding box, and goal pose, and it outputs joint position targets for the 29-DoF dexterous robot (arm + hand).
  • Figure 4: Generalization to Unseen DexToolBench Tools and Tasks in the Real World. We evaluate our policy in the real world on unseen tool-use tasks in DexToolBench. Our evaluations span 24 unique task trajectories across 6 different object categories and 12 object instances. Each bar corresponds to 1 task trajectory on 1 object instance. We report the average Task Progress across 5 rollouts. Despite not being trained on these objects or trajectories, our policy demonstrates strong generalization to diverse tools of varying masses and geometries.
  • Figure 5: Comparison against Baselines in the Real World. We compare SimToolReal against baselines on two variations of sweeping a table with a brush: with and without requiring tool rotation based on the initial states shown on the left. Average Task Progress is indicated in parentheses. SimToolReal succeeds on both variations, performing dexterous in-hand tool rotations in the harder variation. Fixed Grasp succeeds on the simpler variation of this task without tool rotation. However, when rotation is required, enforcing a fixed grasp causes the arm to collide with the table while tracking the target trajectory. Kinematic Retargeting fails to reason about contact forces, and is unable to grasp the brush in both variations.
  • ...and 11 more figures