Learning from Imperfect Demonstrations with Self-Supervision for Robotic Manipulation

Kun Wu; Ning Liu; Zhen Zhao; Di Qiu; Jinming Li; Zhengping Che; Zhiyuan Xu; Jian Tang

Learning from Imperfect Demonstrations with Self-Supervision for Robotic Manipulation

Kun Wu, Ning Liu, Zhen Zhao, Di Qiu, Jinming Li, Zhengping Che, Zhiyuan Xu, Jian Tang

TL;DR

This work tackles offline robotic manipulation when reward signals are unavailable and data are scarce, by introducing Self-Supervised Data Filtering (SSDF). SSDF pretrains a multi-modality transformer with three self-supervised objectives to learn robust representations, then computes quality scores for imperfect trajectories by measuring similarity to expert demonstrations, and finally uses weighted behavior cloning to expand the training set with high-quality imperfect data. The approach is validated on ManiSkill2 and real-world Franka experiments, showing that SSDF consistently improves success rates over strong baselines and that careful objective combinations and similarity measures further enhance performance. The method reduces data waste, improves data utilization, and provides a practical offline solution for leveraging imperfect demonstrations in high-dimensional robotic manipulation tasks.

Abstract

Improving data utilization, especially for imperfect data from task failures, is crucial for robotic manipulation due to the challenging, time-consuming, and expensive data collection process in the real world. Current imitation learning (IL) typically discards imperfect data, focusing solely on successful expert data. While reinforcement learning (RL) can learn from explorations and failures, the sim2real gap and its reliance on dense reward and online exploration make it difficult to apply effectively in real-world scenarios. In this work, we aim to conquer the challenge of leveraging imperfect data without the need for reward information to improve the model performance for robotic manipulation in an offline manner. Specifically, we introduce a Self-Supervised Data Filtering framework (SSDF) that combines expert and imperfect data to compute quality scores for failed trajectory segments. High-quality segments from the failed data are used to expand the training dataset. Then, the enhanced dataset can be used with any downstream policy learning method for robotic manipulation tasks. Extensive experiments on the ManiSkill2 benchmark built on the high-fidelity Sapien simulator and real-world robotic manipulation tasks using the Franka robot arm demonstrated that the SSDF can accurately expand the training dataset with high-quality imperfect data and improve the success rates for all robotic manipulation tasks.

Learning from Imperfect Demonstrations with Self-Supervision for Robotic Manipulation

TL;DR

Abstract

Paper Structure (13 sections, 9 equations, 4 figures, 5 tables)

This paper contains 13 sections, 9 equations, 4 figures, 5 tables.

Introduction
Related Work
Methodology
Preliminary on Imitation Learning
Self-Supervised Data Filtering Framework
Transformer Pretraining via Self-Supervised Learning
Calculation of Quality Score by Similarity Metric
Policy Learning with Weighted Behavior Cloning
Experiments
Experiment Setup
Evaluation Results
Ablation Study in Simulation
Conclusion

Figures (4)

Figure 1: Overview of SSDF. SSDF contains three steps: 1) Transformer Pretraining via Self-Supervised Learning, 2) Calculation of Quality Score by Similarity Metric, and 3) Policy Learning with Weighted Behavior Cloning.
Figure 2: The transformer pre-training process includes three tasks: 1) Masked Transition Prediction (MTP), 2) Transition Reconstruction (TR), and 3) Action Autoregression (AA). Here is an example of the pre-training process with three time-step inputs. Colored lines link the input and output.
Figure 3: Experiment Setup: we conducted experiments on five tasks on the ManiSkill2 benchmark and five tasks using the real-world single Franka robotic arm. Fill and Hang are soft-body tasks. The green circles in PickCube and PickYCB represent the goal position.
Figure 4: Comparisons of different quantity and quality of the imperfect data. The x-axis represents the success rates of the behavior policies that collect the imperfect data. The y-axis shows the success rates of SSDF. The green and red lines represent the ratios of the expert and imperfect data, which are 1:1 and 1:3, respectively.

Learning from Imperfect Demonstrations with Self-Supervision for Robotic Manipulation

TL;DR

Abstract

Learning from Imperfect Demonstrations with Self-Supervision for Robotic Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)