SparseDFF: Sparse-View Feature Distillation for One-Shot Dexterous Manipulation
Qianxu Wang, Haotong Zhang, Congyue Deng, Yang You, Hao Dong, Yixin Zhu, Leonidas Guibas
TL;DR
SparseDFF presents a one-shot approach to dexterous manipulation by distilling view-consistent 3D feature fields from sparse RGBD observations using large 2D vision models. A lightweight feature refinement network trained on a single demonstration enforces multi-view consistency via contrastive learning, followed by a point-pruning step to boost local feature continuity. End-effector pose optimization transfer relies on minimizing feature differences between the demonstration and target scenes, with explicit penalties to ensure physical viability. Real-world experiments with a 24-DOF dexterous hand demonstrate strong generalization to rigid and deformable objects across diverse scenes and poses, outperforming baselines and enabling beyond-grasp interactions. The work highlights the potential of semantic 3D feature fields, distilled from 2D models, for rapid, generalizable manipulation in fixed-camera, sparse-view settings.
Abstract
Humans demonstrate remarkable skill in transferring manipulation abilities across objects of varying shapes, poses, and appearances, a capability rooted in their understanding of semantic correspondences between different instances. To equip robots with a similar high-level comprehension, we present SparseDFF, a novel DFF for 3D scenes utilizing large 2D vision models to extract semantic features from sparse RGBD images, a domain where research is limited despite its relevance to many tasks with fixed-camera setups. SparseDFF generates view-consistent 3D DFFs, enabling efficient one-shot learning of dexterous manipulations by mapping image features to a 3D point cloud. Central to SparseDFF is a feature refinement network, optimized with a contrastive loss between views and a point-pruning mechanism for feature continuity. This facilitates the minimization of feature discrepancies w.r.t. end-effector parameters, bridging demonstrations and target manipulations. Validated in real-world scenarios with a dexterous hand, SparseDFF proves effective in manipulating both rigid and deformable objects, demonstrating significant generalization capabilities across object and scene variations.
