Visual Imitation Learning of Task-Oriented Object Grasping and Rearrangement
Yichen Cai, Jianfeng Gao, Christoph Pohl, Tamim Asfour
TL;DR
Task-oriented manipulation from partial object views and large shape variation is addressed. The authors introduce MIMO, a multi-feature implicit neural field that outputs four spatial branches ${\Phi}_{occ},{\Phi}_{sdf},{\Phi}_{escf},{\Phi}_{cdd}$ and yields a descriptor $z=\kappa(\mathbf{x}|\mathbf{P})$, with a pose descriptor ${}^A\mathbf{Z}_B = \varphi(\mathbf{T},\mathbf{X}|\mathbf{P}^A_r)$ for cross-object transfer; training uses a multi-task loss ${\mathcal{L}} = \sum_{i=1}^{4} (e^{-s_i} {\mathcal{L}}_i + s_i)$ with $s_i = \log(\sigma_i^2)$. A task-oriented grasping framework leverages human demonstrations to select or transfer grasps, trains a GMM on the manifold $\mathbb{R}^3 \times \mathcal{S}^3$, and employs a grasp evaluation network to refine candidates. The approach yields improved shape reconstruction and dense correspondences, enabling robust one- and few-shot imitation in both simulation and real-world experiments, outperforming NDF and NIFT baselines. These results demonstrate practical viability for transferring manipulation skills to unseen objects and support real-time, data-efficient learning of task-oriented grasps and rearrangements.
Abstract
Task-oriented object grasping and rearrangement are critical skills for robots to accomplish different real-world manipulation tasks. However, they remain challenging due to partial observations of the objects and shape variations in categorical objects. In this paper, we propose the Multi-feature Implicit Model (MIMO), a novel object representation that encodes multiple spatial features between a point and an object in an implicit neural field. Training such a model on multiple features ensures that it embeds the object shapes consistently in different aspects, thus improving its performance in object shape reconstruction from partial observation, shape similarity measure, and modeling spatial relations between objects. Based on MIMO, we propose a framework to learn task-oriented object grasping and rearrangement from single or multiple human demonstration videos. The evaluations in simulation show that our approach outperforms the state-of-the-art methods for multi- and single-view observations. Real-world experiments demonstrate the efficacy of our approach in one- and few-shot imitation learning of manipulation tasks.
