Table of Contents
Fetching ...

Learning to Discern: Imitating Heterogeneous Human Demonstrations with Preference and Representation Learning

Sachit Kuhar, Shuo Cheng, Shivang Chopra, Matthew Bronars, Danfei Xu

TL;DR

Offline imitation learning suffers from heterogeneous, suboptimal demonstrations. L2D learns a temporally aware latent representation of trajectory segments, trains a quality critic via offline preference learning, and uses a Gaussian Mixture Model to capture multimodal demonstration quality, enabling effective filtering of high-quality data from unknown-quality datasets. Across simulated and real-robot tasks, L2D identifies high-quality demonstrations from both familiar and unseen demonstrators and yields policy performance close to oracle, outperforming competing baselines. This approach enables scalable, data-centric IL in robotics by robustly assessing demonstration quality without environment interaction or dense rewards.

Abstract

Practical Imitation Learning (IL) systems rely on large human demonstration datasets for successful policy learning. However, challenges lie in maintaining the quality of collected data and addressing the suboptimal nature of some demonstrations, which can compromise the overall dataset quality and hence the learning outcome. Furthermore, the intrinsic heterogeneity in human behavior can produce equally successful but disparate demonstrations, further exacerbating the challenge of discerning demonstration quality. To address these challenges, this paper introduces Learning to Discern (L2D), an offline imitation learning framework for learning from demonstrations with diverse quality and style. Given a small batch of demonstrations with sparse quality labels, we learn a latent representation for temporally embedded trajectory segments. Preference learning in this latent space trains a quality evaluator that generalizes to new demonstrators exhibiting different styles. Empirically, we show that L2D can effectively assess and learn from varying demonstrations, thereby leading to improved policy performance across a range of tasks in both simulations and on a physical robot.

Learning to Discern: Imitating Heterogeneous Human Demonstrations with Preference and Representation Learning

TL;DR

Offline imitation learning suffers from heterogeneous, suboptimal demonstrations. L2D learns a temporally aware latent representation of trajectory segments, trains a quality critic via offline preference learning, and uses a Gaussian Mixture Model to capture multimodal demonstration quality, enabling effective filtering of high-quality data from unknown-quality datasets. Across simulated and real-robot tasks, L2D identifies high-quality demonstrations from both familiar and unseen demonstrators and yields policy performance close to oracle, outperforming competing baselines. This approach enables scalable, data-centric IL in robotics by robustly assessing demonstration quality without environment interaction or dense rewards.

Abstract

Practical Imitation Learning (IL) systems rely on large human demonstration datasets for successful policy learning. However, challenges lie in maintaining the quality of collected data and addressing the suboptimal nature of some demonstrations, which can compromise the overall dataset quality and hence the learning outcome. Furthermore, the intrinsic heterogeneity in human behavior can produce equally successful but disparate demonstrations, further exacerbating the challenge of discerning demonstration quality. To address these challenges, this paper introduces Learning to Discern (L2D), an offline imitation learning framework for learning from demonstrations with diverse quality and style. Given a small batch of demonstrations with sparse quality labels, we learn a latent representation for temporally embedded trajectory segments. Preference learning in this latent space trains a quality evaluator that generalizes to new demonstrators exhibiting different styles. Empirically, we show that L2D can effectively assess and learn from varying demonstrations, thereby leading to improved policy performance across a range of tasks in both simulations and on a physical robot.
Paper Structure (16 sections, 6 figures, 7 tables)

This paper contains 16 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: L2D: Our framework proceeds in three primary stages during training. First, we augment trajectory segments with temporal embeddings and employ contrastive learning to map these segments to a latent space. Next, we use preference learning in this latent space to train a quality critic on sparse preference labels. Finally, we train a Gaussian Mixture Model (GMM) on the critic's outputs where the different modes represent demonstrator quality.
  • Figure 2: Filtering Unseen Demonstrations: When faced with unseen demonstrations, L2D partitions the trajectory into segments and augments each with its chronological ordering in the sequence. The segments are mapped to the latent space learned during training and ranked by the quality critic. After calculating the mean and variance of ranks in a full trajectory, the trained GMM is employed to predict a preference label for the unseen demonstration.
  • Figure 3: Good (green) and Bad (red) demonstrations for the Robomimic's Square task.
  • Figure 4: Histogram Distribution Visualization: Comparing demonstration quality scores (good, okay, bad) for the Square task from unseen demonstrators. The left histogram represents a conventional preference learning approach, while the right highlights the efficacy of our method, L2D
  • Figure 5: Demo Quality Estimation. We show the predicted quality for real-world demonstrations of the Stack task. The trajectory segments with less optimal behaviors (e.g., jittering or waving) will be assigned with lower scores (marked with red bounding boxes).
  • ...and 1 more figures