Table of Contents
Fetching ...

LensDFF: Language-enhanced Sparse Feature Distillation for Efficient Few-Shot Dexterous Manipulation

Qian Feng, David S. Martinez Lema, Jianxiang Feng, Zhaopeng Chen, Alois Knoll

TL;DR

This work tackles few-shot dexterous manipulation by eliminating heavy per-scene training and NeRF-style rendering. It introduces LensDFF, a language-enhanced sparse feature distillation method that aligns 2D vision features from sparse views with language features to produce coherent 3D feature representations for grasp optimization. By integrating grasp primitives and a real2sim evaluation loop, LensDFF demonstrates robust, dexterous grasping on unseen objects from a single view, with strong real-world performance and competitive simulation results. The approach offers a practical, scalable pathway to language-guided manipulation in real-world robotics, reducing data collection and computation while maintaining high dexterity.

Abstract

Learning dexterous manipulation from few-shot demonstrations is a significant yet challenging problem for advanced, human-like robotic systems. Dense distilled feature fields have addressed this challenge by distilling rich semantic features from 2D visual foundation models into the 3D domain. However, their reliance on neural rendering models such as Neural Radiance Fields (NeRF) or Gaussian Splatting results in high computational costs. In contrast, previous approaches based on sparse feature fields either suffer from inefficiencies due to multi-view dependencies and extensive training or lack sufficient grasp dexterity. To overcome these limitations, we propose Language-ENhanced Sparse Distilled Feature Field (LensDFF), which efficiently distills view-consistent 2D features onto 3D points using our novel language-enhanced feature fusion strategy, thereby enabling single-view few-shot generalization. Based on LensDFF, we further introduce a few-shot dexterous manipulation framework that integrates grasp primitives into the demonstrations to generate stable and highly dexterous grasps. Moreover, we present a real2sim grasp evaluation pipeline for efficient grasp assessment and hyperparameter tuning. Through extensive simulation experiments based on the real2sim pipeline and real-world experiments, our approach achieves competitive grasping performance, outperforming state-of-the-art approaches.

LensDFF: Language-enhanced Sparse Feature Distillation for Efficient Few-Shot Dexterous Manipulation

TL;DR

This work tackles few-shot dexterous manipulation by eliminating heavy per-scene training and NeRF-style rendering. It introduces LensDFF, a language-enhanced sparse feature distillation method that aligns 2D vision features from sparse views with language features to produce coherent 3D feature representations for grasp optimization. By integrating grasp primitives and a real2sim evaluation loop, LensDFF demonstrates robust, dexterous grasping on unseen objects from a single view, with strong real-world performance and competitive simulation results. The approach offers a practical, scalable pathway to language-guided manipulation in real-world robotics, reducing data collection and computation while maintaining high dexterity.

Abstract

Learning dexterous manipulation from few-shot demonstrations is a significant yet challenging problem for advanced, human-like robotic systems. Dense distilled feature fields have addressed this challenge by distilling rich semantic features from 2D visual foundation models into the 3D domain. However, their reliance on neural rendering models such as Neural Radiance Fields (NeRF) or Gaussian Splatting results in high computational costs. In contrast, previous approaches based on sparse feature fields either suffer from inefficiencies due to multi-view dependencies and extensive training or lack sufficient grasp dexterity. To overcome these limitations, we propose Language-ENhanced Sparse Distilled Feature Field (LensDFF), which efficiently distills view-consistent 2D features onto 3D points using our novel language-enhanced feature fusion strategy, thereby enabling single-view few-shot generalization. Based on LensDFF, we further introduce a few-shot dexterous manipulation framework that integrates grasp primitives into the demonstrations to generate stable and highly dexterous grasps. Moreover, we present a real2sim grasp evaluation pipeline for efficient grasp assessment and hyperparameter tuning. Through extensive simulation experiments based on the real2sim pipeline and real-world experiments, our approach achieves competitive grasping performance, outperforming state-of-the-art approaches.

Paper Structure

This paper contains 26 sections, 6 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: LensDFF demo data pipeline. Given a user prompt including the object name and grasp primitive, a closest demo is retrieved where their demo prompt features $\mathbf{f}_{\text{lan}}^{\text{demo}}$ are compared with test prompt features $\mathbf{f}_{\text{lan}}^{\text{test}}$ for test-time language feature alignment. The resulting language feature is then used for language feature enhancement, aligning vision features $\mathbf{f}_{\text{vis}}$ from multiple demo viewpoints to generate consistent distilled 3D features.
  • Figure 2: LensDFF test data pipeline. Our approach applies SAM2ravi2024sam2 to a single RGB image to detect the target object. A second view is selected if the object is not visible. The same test-time language feature alignment and language feature enhancement as in the demo data pipeline are applied. The main difference is that only vision features from one view are projected. Finally, the 3D distilled features from both the demo and test data are utilized for grasp optimization.
  • Figure 3: Demo Grasps with Diverse Grasp Primitives. This figure illustrates the versatility of our collected demos using different grasp primitives across a range of objects. (a) Pinch grasp: The robot delicately pinches the teddy bear's ear between the thumb and index finger, demonstrating precision and control for handling small or delicate objects. (b) Hook grasp: The robot secures the handle of a dustpan using a hook grasp, forming hooks with its fingers to ensure a firm grip for lifting or carrying. (c) Tripod grasp: The Mentos gum package is grasped with a tripod grasp, where the thumb and two fingers provide stability and dexterity for precise manipulation. (d) Cylindrical grasp: The robot wraps its fingers around the white mug, forming a cylindrical grasp that ensures stability and force closure for larger objects. (e) Lumbrical grasp: The robot adopts a lumbrical grasp to hold the crackers box, with fingers are positioned parallel to the object's surface, offering a secure grip for flat or boxy objects.
  • Figure 4: Visualization of the palm pose sampler. (c) is an example where the poses are sampled from a partial view of a drill.
  • Figure 5: Real-World Experimental Setup and Objects. (a) Robot setup for real-world experiment. (b) 10 daily objects used for demo collection. (c) 12 testing YCB objects calli2015ycb.
  • ...and 3 more figures