Visual Imitation Learning of Task-Oriented Object Grasping and Rearrangement

Yichen Cai; Jianfeng Gao; Christoph Pohl; Tamim Asfour

Visual Imitation Learning of Task-Oriented Object Grasping and Rearrangement

Yichen Cai, Jianfeng Gao, Christoph Pohl, Tamim Asfour

TL;DR

Task-oriented manipulation from partial object views and large shape variation is addressed. The authors introduce MIMO, a multi-feature implicit neural field that outputs four spatial branches ${\Phi}_{occ},{\Phi}_{sdf},{\Phi}_{escf},{\Phi}_{cdd}$ and yields a descriptor $z=\kappa(\mathbf{x}|\mathbf{P})$, with a pose descriptor ${}^A\mathbf{Z}_B = \varphi(\mathbf{T},\mathbf{X}|\mathbf{P}^A_r)$ for cross-object transfer; training uses a multi-task loss ${\mathcal{L}} = \sum_{i=1}^{4} (e^{-s_i} {\mathcal{L}}_i + s_i)$ with $s_i = \log(\sigma_i^2)$. A task-oriented grasping framework leverages human demonstrations to select or transfer grasps, trains a GMM on the manifold $\mathbb{R}^3 \times \mathcal{S}^3$, and employs a grasp evaluation network to refine candidates. The approach yields improved shape reconstruction and dense correspondences, enabling robust one- and few-shot imitation in both simulation and real-world experiments, outperforming NDF and NIFT baselines. These results demonstrate practical viability for transferring manipulation skills to unseen objects and support real-time, data-efficient learning of task-oriented grasps and rearrangements.

Abstract

Task-oriented object grasping and rearrangement are critical skills for robots to accomplish different real-world manipulation tasks. However, they remain challenging due to partial observations of the objects and shape variations in categorical objects. In this paper, we propose the Multi-feature Implicit Model (MIMO), a novel object representation that encodes multiple spatial features between a point and an object in an implicit neural field. Training such a model on multiple features ensures that it embeds the object shapes consistently in different aspects, thus improving its performance in object shape reconstruction from partial observation, shape similarity measure, and modeling spatial relations between objects. Based on MIMO, we propose a framework to learn task-oriented object grasping and rearrangement from single or multiple human demonstration videos. The evaluations in simulation show that our approach outperforms the state-of-the-art methods for multi- and single-view observations. Real-world experiments demonstrate the efficacy of our approach in one- and few-shot imitation learning of manipulation tasks.

Visual Imitation Learning of Task-Oriented Object Grasping and Rearrangement

TL;DR

Task-oriented manipulation from partial object views and large shape variation is addressed. The authors introduce MIMO, a multi-feature implicit neural field that outputs four spatial branches

and yields a descriptor

, with a pose descriptor

for cross-object transfer; training uses a multi-task loss

with

. A task-oriented grasping framework leverages human demonstrations to select or transfer grasps, trains a GMM on the manifold

, and employs a grasp evaluation network to refine candidates. The approach yields improved shape reconstruction and dense correspondences, enabling robust one- and few-shot imitation in both simulation and real-world experiments, outperforming NDF and NIFT baselines. These results demonstrate practical viability for transferring manipulation skills to unseen objects and support real-time, data-efficient learning of task-oriented grasps and rearrangements.

Abstract

Paper Structure (25 sections, 7 figures, 3 tables)

This paper contains 25 sections, 7 figures, 3 tables.

Introduction
Related Work
Neural Fields and Neural Descriptors
Modeling Task Relevance
Category-Level Manipulation
MIMO for Manipulation
Multi-feature Implicit Model
Multi-task Loss Function
Pose Descriptor
Pose Transfer
MIMO-based Grasp Framework
Human Observation
Task-oriented Grasp Learning
Grasp Evaluation
Inference
...and 10 more sections

Figures (7)

Figure 1: Learning task-oriented object grasping and rearrangement from human demonstration videos of manipulation tasks. We illustrate two tasks: \ref{['subfig:side_g']} side picking a mug and pouring into a bowl; and \ref{['subfig:top_g']} top-down picking a mug and placing it into a container. For each task, we show the RGB image, the observed point clouds ( $\textcolor{black}{\bullet}$ ), reconstructed object meshes (), extracted hand mesh (), grasp poses ($\mathbf{T}_g^d$), and the execution on a humanoid robot.
Figure 2: Multi-feature Implicit Model (MIMO) and its applications. \ref{['subfig:MIMO_model_structure']} MIMO takes as input an object point cloud $\mathbf{P}$ and a point coordinate $\mathbf{x}$ and outputs multiple spatial features of $\mathbf{x}$ relative to $\mathbf{P}$, including occupancy $\mathbf{\Phi}_{\text{occ}}$, signed distance $\mathbf{\Phi}_{\text{sdf}}$, extended space coverage feature (ESCF) $\mathbf{\Phi}_{\text{escf}}$ and closest distance direction (CDD) $\mathbf{\Phi}_{\text{cdd}}$. The concatenation of activation layers of the decoder for $\mathbf{\Phi}_{\text{escf}}$ and $\mathbf{\Phi}_{\text{cdd}}$ forms the point descriptor $\mathbf{z}$ of $\mathbf{x}$. \ref{['subfig:cdd']} The CDD is represented as the inner product of two unit vectors $\mathbf{v}_{p}$ and $\mathbf{v}_{d}$. \ref{['subfig:similarity_measure']} The high-dimensional point descriptors of each reference object are reduced to a 3D space using Principal Component Analysis (PCA) representing the RGB channels of the color map. Each point of other categorical object instances is colorized according to the most similar point (smallest L1 distance in point descriptors) from the corresponding reference object. The MIMO can be used for \ref{['subfig:shape_comp']} object shape reconstruction and \ref{['subfig:grasp_transfer']} grasp pose transfer.
Figure 3: Point correspondence and shape similarity measure using point descriptors from partially-observed point clouds ( $\textcolor{black}{\bullet}$ ). Given a point on a reference object, we colorize the novel object mesh based on the L1 distance of point descriptors to the reference point, where blue means more similar, and mark the most similar points ( $\textcolor{green}{\bullet}$ ).
Figure 4: Proposed MIMO-based Grasp Framework. (a) Given a human demonstration of a grasping scene, we obtain the object point cloud $\mathbf{P}^{d}$ and a grasp pose $\mathbf{T}_{g}^d$. We generate task-agnostic grasp poses $\{\mathbf{T}_g^{a}\}$ using a grasp generator sundermeyer2021contact, and use MIMO as a discriminator to select the task-relevant candidates $\{\mathbf{T}_g^{r}\}$ based on pose descriptor similarities between $\mathbf{T}_g^{d}$ and $\mathbf{T}_g^{a}$. Alternatively, we can directly transfer the demonstrated grasp pose $\mathbf{T}_g^d$ to the canonical point cloud $\mathbf{P}^c$ using MIMO. We then simulate the candidates $\{\mathbf{T}_g^{r}\}$ to find the successful grasp poses $\{\bar{\mathbf{T}}_g^{r}\}$ to train a GMM. (b) Given an object point cloud $\mathbf{P}$, a grasp pose $\mathbf{T}_g$ and a set of hand keypoints $\mathbf{P}^k$, the grasp evaluation network encodes $\mathbf{P}$ using the frozen encoder $\epsilon(\cdot)$ of MIMO and outputs the grasp success probability using MLP. (c) During inference, the sampled grasp pose $\hat{\mathbf{T}}_g$ relative to the canonical point cloud $\mathbf{P}^c$ is transferred to a partially-observed point cloud $\mathbf{P}^o$ using MIMO, and the transferred grasp pose $\tilde{\mathbf{T}}_g$ is evaluated and refined (if necessary) to obtain the optimal grasp pose $\mathbf{T}_g^*$.
Figure 5: Success rate of the pick-and-place tasks \ref{['item:t1']}-\ref{['item:t3']} with unseen objects under setting \ref{['item:demo10']} for models NDF, NIFT, MIMO3, and MIMO4, respectively.
...and 2 more figures

Visual Imitation Learning of Task-Oriented Object Grasping and Rearrangement

TL;DR

Abstract

Visual Imitation Learning of Task-Oriented Object Grasping and Rearrangement

Authors

TL;DR

Abstract

Table of Contents

Figures (7)