Table of Contents
Fetching ...

Sample-Efficient Expert Query Control in Active Imitation Learning via Conformal Prediction

Arad Firouzkouhi, Omid Mirzaeedodangeh, Lars Lindemann

TL;DR

CRSAIL tackles covariate shift in active imitation learning by post hoc querying for actions only in under-represented states, as measured by distance to K-th nearest expert state. It grounds query threshold selection in conformal prediction, calibrating a single radius R using an initial on-policy rollout and a user-specified miscoverage α to control the expected query rate. The method demonstrates dramatic reductions in expert labeling while maintaining expert-level rewards across MuJoCo tasks, and shows robustness to α and K, enabling deployment on new systems with unknown dynamics. Overall, CRSAIL offers a principled, data-driven approach to efficient expert querying that preserves learning effectiveness without requiring real-time expert interventions.

Abstract

Active imitation learning (AIL) combats covariate shift by querying an expert during training. However, expert action labeling often dominates the cost, especially in GPU-intensive simulators, human-in-the-loop settings, and robot fleets that revisit near-duplicate states. We present Conformalized Rejection Sampling for Active Imitation Learning (CRSAIL), a querying rule that requests an expert action only when the visited state is under-represented in the expert-labeled dataset. CRSAIL scores state novelty by the distance to the $K$-th nearest expert state and sets a single global threshold via conformal prediction. This threshold is the empirical $(1-α)$ quantile of on-policy calibration scores, providing a distribution-free calibration rule that links $α$ to the expected query rate and makes $α$ a task-agnostic tuning knob. This state-space querying strategy is robust to outliers and, unlike safety-gate-based AIL, can be run without real-time expert takeovers: we roll out full trajectories (episodes) with the learner and only afterward query the expert on a subset of visited states. Evaluated on MuJoCo robotics tasks, CRSAIL matches or exceeds expert-level reward while reducing total expert queries by up to 96% vs. DAgger and up to 65% vs. prior AIL methods, with empirical robustness to $α$ and $K$, easing deployment on novel systems with unknown dynamics.

Sample-Efficient Expert Query Control in Active Imitation Learning via Conformal Prediction

TL;DR

CRSAIL tackles covariate shift in active imitation learning by post hoc querying for actions only in under-represented states, as measured by distance to K-th nearest expert state. It grounds query threshold selection in conformal prediction, calibrating a single radius R using an initial on-policy rollout and a user-specified miscoverage α to control the expected query rate. The method demonstrates dramatic reductions in expert labeling while maintaining expert-level rewards across MuJoCo tasks, and shows robustness to α and K, enabling deployment on new systems with unknown dynamics. Overall, CRSAIL offers a principled, data-driven approach to efficient expert querying that preserves learning effectiveness without requiring real-time expert interventions.

Abstract

Active imitation learning (AIL) combats covariate shift by querying an expert during training. However, expert action labeling often dominates the cost, especially in GPU-intensive simulators, human-in-the-loop settings, and robot fleets that revisit near-duplicate states. We present Conformalized Rejection Sampling for Active Imitation Learning (CRSAIL), a querying rule that requests an expert action only when the visited state is under-represented in the expert-labeled dataset. CRSAIL scores state novelty by the distance to the -th nearest expert state and sets a single global threshold via conformal prediction. This threshold is the empirical quantile of on-policy calibration scores, providing a distribution-free calibration rule that links to the expected query rate and makes a task-agnostic tuning knob. This state-space querying strategy is robust to outliers and, unlike safety-gate-based AIL, can be run without real-time expert takeovers: we roll out full trajectories (episodes) with the learner and only afterward query the expert on a subset of visited states. Evaluated on MuJoCo robotics tasks, CRSAIL matches or exceeds expert-level reward while reducing total expert queries by up to 96% vs. DAgger and up to 65% vs. prior AIL methods, with empirical robustness to and , easing deployment on novel systems with unknown dynamics.

Paper Structure

This paper contains 22 sections, 21 equations, 3 figures, 5 tables, 2 algorithms.

Figures (3)

  • Figure 1: Overview of CRSAIL. (A) We assign each learner state a geometric novelty score $s_K(x_t)$ based on the distance to the $K$-nearest neighbours in the expert dataset $D_{\mathrm{exp}}$. (B) Offline, we roll out an initial behavior cloning policy, compute scores for visited states, and set a single query threshold $R$ as the $(1-\alpha)$-quantile of this distribution. (C) Online, during training, we roll out our learner policy for one trajectory, and query the expert where $s_K(x_t) > R$. We add the labeled state to $D_{\mathrm{exp}}$, and retrain the policy.
  • Figure 2: Querying metrics throughout training, aggregated between all initial datasets of the same size (showing $M=1000$ as a representative example) for InvDP. (a) shows the obtainable reward with regard to the number of queries, and (b) shows the number of queries made during the training process. Filled areas show standard deviation.
  • Figure 3: Number of queries made in each episode during training, based on the length of the episode for the inverted double pendulum task.