LensDFF: Language-enhanced Sparse Feature Distillation for Efficient Few-Shot Dexterous Manipulation

Qian Feng; David S. Martinez Lema; Jianxiang Feng; Zhaopeng Chen; Alois Knoll

LensDFF: Language-enhanced Sparse Feature Distillation for Efficient Few-Shot Dexterous Manipulation

Qian Feng, David S. Martinez Lema, Jianxiang Feng, Zhaopeng Chen, Alois Knoll

TL;DR

This work tackles few-shot dexterous manipulation by eliminating heavy per-scene training and NeRF-style rendering. It introduces LensDFF, a language-enhanced sparse feature distillation method that aligns 2D vision features from sparse views with language features to produce coherent 3D feature representations for grasp optimization. By integrating grasp primitives and a real2sim evaluation loop, LensDFF demonstrates robust, dexterous grasping on unseen objects from a single view, with strong real-world performance and competitive simulation results. The approach offers a practical, scalable pathway to language-guided manipulation in real-world robotics, reducing data collection and computation while maintaining high dexterity.

Abstract

Learning dexterous manipulation from few-shot demonstrations is a significant yet challenging problem for advanced, human-like robotic systems. Dense distilled feature fields have addressed this challenge by distilling rich semantic features from 2D visual foundation models into the 3D domain. However, their reliance on neural rendering models such as Neural Radiance Fields (NeRF) or Gaussian Splatting results in high computational costs. In contrast, previous approaches based on sparse feature fields either suffer from inefficiencies due to multi-view dependencies and extensive training or lack sufficient grasp dexterity. To overcome these limitations, we propose Language-ENhanced Sparse Distilled Feature Field (LensDFF), which efficiently distills view-consistent 2D features onto 3D points using our novel language-enhanced feature fusion strategy, thereby enabling single-view few-shot generalization. Based on LensDFF, we further introduce a few-shot dexterous manipulation framework that integrates grasp primitives into the demonstrations to generate stable and highly dexterous grasps. Moreover, we present a real2sim grasp evaluation pipeline for efficient grasp assessment and hyperparameter tuning. Through extensive simulation experiments based on the real2sim pipeline and real-world experiments, our approach achieves competitive grasping performance, outperforming state-of-the-art approaches.

LensDFF: Language-enhanced Sparse Feature Distillation for Efficient Few-Shot Dexterous Manipulation

TL;DR

Abstract

LensDFF: Language-enhanced Sparse Feature Distillation for Efficient Few-Shot Dexterous Manipulation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)