Table of Contents
Fetching ...

AED: Adaptable Error Detection for Few-shot Imitation Policy

Jia-Fong Yeh, Kuo-Han Hung, Pang-Chi Lo, Chi-Ming Chung, Tsung-Han Wu, Hung-Ting Su, Yi-Ting Chen, Winston H. Hsu

TL;DR

A cross-domain AED benchmark is developed, consisting of 322 base and 153 novel environments, and PrObe, equipped with a powerful pattern extractor and guided by novel learning objectives to parse discernible patterns in the policy feature representations of normal or error states is proposed.

Abstract

We introduce a new task called Adaptable Error Detection (AED), which aims to identify behavior errors in few-shot imitation (FSI) policies based on visual observations in novel environments. The potential to cause serious damage to surrounding areas limits the application of FSI policies in real-world scenarios. Thus, a robust system is necessary to notify operators when FSI policies are inconsistent with the intent of demonstrations. This task introduces three challenges: (1) detecting behavior errors in novel environments, (2) identifying behavior errors that occur without revealing notable changes, and (3) lacking complete temporal information of the rollout due to the necessity of online detection. However, the existing benchmarks cannot support the development of AED because their tasks do not present all these challenges. To this end, we develop a cross-domain AED benchmark, consisting of 322 base and 153 novel environments. Additionally, we propose Pattern Observer (PrObe) to address these challenges. PrObe is equipped with a powerful pattern extractor and guided by novel learning objectives to parse discernible patterns in the policy feature representations of normal or error states. Through our comprehensive evaluation, PrObe demonstrates superior capability to detect errors arising from a wide range of FSI policies, consistently surpassing strong baselines. Moreover, we conduct detailed ablations and a pilot study on error correction to validate the effectiveness of the proposed architecture design and the practicality of the AED task, respectively. The AED project page can be found at https://aed-neurips.github.io/.

AED: Adaptable Error Detection for Few-shot Imitation Policy

TL;DR

A cross-domain AED benchmark is developed, consisting of 322 base and 153 novel environments, and PrObe, equipped with a powerful pattern extractor and guided by novel learning objectives to parse discernible patterns in the policy feature representations of normal or error states is proposed.

Abstract

We introduce a new task called Adaptable Error Detection (AED), which aims to identify behavior errors in few-shot imitation (FSI) policies based on visual observations in novel environments. The potential to cause serious damage to surrounding areas limits the application of FSI policies in real-world scenarios. Thus, a robust system is necessary to notify operators when FSI policies are inconsistent with the intent of demonstrations. This task introduces three challenges: (1) detecting behavior errors in novel environments, (2) identifying behavior errors that occur without revealing notable changes, and (3) lacking complete temporal information of the rollout due to the necessity of online detection. However, the existing benchmarks cannot support the development of AED because their tasks do not present all these challenges. To this end, we develop a cross-domain AED benchmark, consisting of 322 base and 153 novel environments. Additionally, we propose Pattern Observer (PrObe) to address these challenges. PrObe is equipped with a powerful pattern extractor and guided by novel learning objectives to parse discernible patterns in the policy feature representations of normal or error states. Through our comprehensive evaluation, PrObe demonstrates superior capability to detect errors arising from a wide range of FSI policies, consistently surpassing strong baselines. Moreover, we conduct detailed ablations and a pilot study on error correction to validate the effectiveness of the proposed architecture design and the practicality of the AED task, respectively. The AED project page can be found at https://aed-neurips.github.io/.
Paper Structure (63 sections, 6 equations, 15 figures, 8 tables)

This paper contains 63 sections, 6 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: Our novel adaptable error detection (AED) task. To monitor the behavior of the few-shot imitation (FSI) policy $\pi_{\theta}$, the adaptable error detector needs to address three challenges: (1) it works in novel environments, (2) no notable changes reveal when behavior errors occur, and (3) it requires online detection. These challenges make existing error detection methods infeasible.
  • Figure 2: Our AED protocol. the successful agent rollouts $X^{b}_{succ}$, failed agent rollouts $X^{b}_{fail}$, and a few expert demonstrations $\mathcal{D}^{b}$ are available for all base environments $E^{b}$. Then, the task contains three phases: policy training, AED training, and AED inference. We aim to train an adaptable error detector $\phi$ to report policy $\pi_{\theta}$'s behavior errors when performing in novel environments $E^{n}$.
  • Figure 3: Architecture of PrObe. PrObe detects behavior errors through the pattern extracted from policy features. The learnable gated pattern extractor and flow generator (LSTM) compute the pattern flow of history features $f_{h}$. Then, the fusion with transformed task-embeddings $f_{\zeta}$ aims to compare the task consistency. PrObe predicts the behavior error based on the fused embeddings. Objectives, $L_{pat}$, $L_{tem}$, and $L_{cls}$, optimize the corresponding outputs.
  • Figure 4: Performance comparison of AED methods on seven challenging FSI tasks. The values under each policy indicate its success rate for each task. AUROC[$\uparrow$] and AUPRC[$\uparrow$] scores are listed in the upper and lower rows for each policy, respectively, ranging from 0 to 1. According to the statistics table, PrObe achieves the highest Top 1 counts (15 and 17 out of 21 cases), average ranking, and average performance difference in both metrics, demonstrating its superiority and robustness.
  • Figure 5: Visualization of timing accuracy. Raw probabilities and SVDDED outputs of selected successful (left) and failed (right) rollouts are drawn. PrObe raises the error at the accurate timing in the failed rollout and stably recognizes normal states in the successful case.
  • ...and 10 more figures