Table of Contents
Fetching ...

SparseDFF: Sparse-View Feature Distillation for One-Shot Dexterous Manipulation

Qianxu Wang, Haotong Zhang, Congyue Deng, Yang You, Hao Dong, Yixin Zhu, Leonidas Guibas

TL;DR

SparseDFF presents a one-shot approach to dexterous manipulation by distilling view-consistent 3D feature fields from sparse RGBD observations using large 2D vision models. A lightweight feature refinement network trained on a single demonstration enforces multi-view consistency via contrastive learning, followed by a point-pruning step to boost local feature continuity. End-effector pose optimization transfer relies on minimizing feature differences between the demonstration and target scenes, with explicit penalties to ensure physical viability. Real-world experiments with a 24-DOF dexterous hand demonstrate strong generalization to rigid and deformable objects across diverse scenes and poses, outperforming baselines and enabling beyond-grasp interactions. The work highlights the potential of semantic 3D feature fields, distilled from 2D models, for rapid, generalizable manipulation in fixed-camera, sparse-view settings.

Abstract

Humans demonstrate remarkable skill in transferring manipulation abilities across objects of varying shapes, poses, and appearances, a capability rooted in their understanding of semantic correspondences between different instances. To equip robots with a similar high-level comprehension, we present SparseDFF, a novel DFF for 3D scenes utilizing large 2D vision models to extract semantic features from sparse RGBD images, a domain where research is limited despite its relevance to many tasks with fixed-camera setups. SparseDFF generates view-consistent 3D DFFs, enabling efficient one-shot learning of dexterous manipulations by mapping image features to a 3D point cloud. Central to SparseDFF is a feature refinement network, optimized with a contrastive loss between views and a point-pruning mechanism for feature continuity. This facilitates the minimization of feature discrepancies w.r.t. end-effector parameters, bridging demonstrations and target manipulations. Validated in real-world scenarios with a dexterous hand, SparseDFF proves effective in manipulating both rigid and deformable objects, demonstrating significant generalization capabilities across object and scene variations.

SparseDFF: Sparse-View Feature Distillation for One-Shot Dexterous Manipulation

TL;DR

SparseDFF presents a one-shot approach to dexterous manipulation by distilling view-consistent 3D feature fields from sparse RGBD observations using large 2D vision models. A lightweight feature refinement network trained on a single demonstration enforces multi-view consistency via contrastive learning, followed by a point-pruning step to boost local feature continuity. End-effector pose optimization transfer relies on minimizing feature differences between the demonstration and target scenes, with explicit penalties to ensure physical viability. Real-world experiments with a 24-DOF dexterous hand demonstrate strong generalization to rigid and deformable objects across diverse scenes and poses, outperforming baselines and enabling beyond-grasp interactions. The work highlights the potential of semantic 3D feature fields, distilled from 2D models, for rapid, generalizable manipulation in fixed-camera, sparse-view settings.

Abstract

Humans demonstrate remarkable skill in transferring manipulation abilities across objects of varying shapes, poses, and appearances, a capability rooted in their understanding of semantic correspondences between different instances. To equip robots with a similar high-level comprehension, we present SparseDFF, a novel DFF for 3D scenes utilizing large 2D vision models to extract semantic features from sparse RGBD images, a domain where research is limited despite its relevance to many tasks with fixed-camera setups. SparseDFF generates view-consistent 3D DFFs, enabling efficient one-shot learning of dexterous manipulations by mapping image features to a 3D point cloud. Central to SparseDFF is a feature refinement network, optimized with a contrastive loss between views and a point-pruning mechanism for feature continuity. This facilitates the minimization of feature discrepancies w.r.t. end-effector parameters, bridging demonstrations and target manipulations. Validated in real-world scenarios with a dexterous hand, SparseDFF proves effective in manipulating both rigid and deformable objects, demonstrating significant generalization capabilities across object and scene variations.
Paper Structure (21 sections, 6 equations, 7 figures, 2 tables)

This paper contains 21 sections, 6 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Constructing sparse-view dff. (a) Starting with the aggregation of DINO features, we form an initial 3D dff. (b) Next, a lightweight network then refines these features, trained solely on a single demonstration and employing contrastive loss to improve field consistency. (c) Finally, a pruning algorithm assesses points through feature similarity in their vicinity. Points with minimal votes are eliminated.
  • Figure 2: End-effector optimization. (a) We sample query points on the end-effector and compute their features using the learned 3D feature field. Minimizing the feature differences as an energy function facilitates the transfer of the end-effector pose from the source demonstration to the target manipulation. (b) The color gradient on the hand indicates the optimization steps from start to end.
  • Figure 3: Qualitative results on rigid objects grasping. Each panel illustrates the initial grasping pose, determined via our end-effector optimization, followed by a frame capturing the successful lift-off of the target object. (a) Grasping Box1 and transferring the skill to Boxes in new poses, including a distinct box Box2. (b) A functional grasp of a drill by its handle. (c) Transferring the learned grasp on Bowl1 to bowls with varied shapes (top row) and cross-category generalization to Mugs (bottom row).
  • Figure 4: Qualitative results on deformable objects grasping. For each successful grasp, we show the initial grasping pose and a frame demonstrating the successful lift of the object off the table. (a) Learning to grasp SmallBear and transferring this skill to various poses and to the Monkey. (b) Learning to grasp BigBear by the nose is challenging due to its small nose. (c) Learning to grasp the Monkey, showcasing adaptability to significant deformations and transfers to SmallBear. Additionally, a challenging scenario is presented where the Monkey is surrounded by multiple objects, showing the capability to handle interactions and occlusions.
  • Figure 5: Pet toy animals. (a) Head caressing is transferred from a single, lying Monkey to a scene with the Monkey hugging the BigBear, exemplifying the method's adaptability to varying scene compositions and interactions. (b) Butt patting is transferred from the Monkey to the SmallBear, whether the SmallBear is alone or in different scene contexts, underlining the method's versatility across various scenarios and object interactions.
  • ...and 2 more figures