Table of Contents
Fetching ...

Aligning Human Intent from Imperfect Demonstrations with Confidence-based Inverse soft-Q Learning

Xizhou Bu, Wenjuan Li, Zhengxiong Liu, Zhiqiang Ma, Panfeng Huang

TL;DR

This work tackles learning from imperfect human demonstrations by proposing Confidence-based Inverse soft-Q Learning (CIQL), which assigns fine-grained confidence scores to transitions via a transition-based noise angle. It introduces two learning variants, CIQL-E (expert-focused) and CIQL-A (agent-focused), and demonstrates that penalizing noise yields better alignment with human intent than simple filtering. By recovering rewards with the inverse soft-Bellman operator and using occupancy measures, CIQL improves over the IQ-Learn baseline, achieving up to 40.3% average gains on linear tasks and showing strong reward-policy alignment (e.g., a -0.92 correlation for CIQL-A versus 0.46 for CIQL-E). The approach generalizes to multi-stage tasks like block stacking and transfers to Sim2Real, highlighting practical impact for learning from imperfect demonstrations in robotics while emphasizing the importance of noise handling and confidence estimation. $20^{\circ}$ to $60^{\circ}$ is identified as the optimal noise-angle range for linear tasks, and penalizing noise is found to better reflect human intent than filtering alone.

Abstract

Imitation learning attracts much attention for its ability to allow robots to quickly learn human manipulation skills through demonstrations. However, in the real world, human demonstrations often exhibit random behavior that is not intended by humans. Collecting high-quality human datasets is both challenging and expensive. Consequently, robots need to have the ability to learn behavioral policies that align with human intent from imperfect demonstrations. Previous work uses confidence scores to extract useful information from imperfect demonstrations, which relies on access to ground truth rewards or active human supervision. In this paper, we propose a transition-based method to obtain fine-grained confidence scores for data without the above efforts, which can increase the success rate of the baseline algorithm by 40.3$\%$ on average. We develop a generalized confidence-based imitation learning framework for guiding policy learning, called Confidence-based Inverse soft-Q Learning (CIQL), as shown in Fig.1. Based on this, we analyze two ways of processing noise and find that penalization is more aligned with human intent than filtering.

Aligning Human Intent from Imperfect Demonstrations with Confidence-based Inverse soft-Q Learning

TL;DR

This work tackles learning from imperfect human demonstrations by proposing Confidence-based Inverse soft-Q Learning (CIQL), which assigns fine-grained confidence scores to transitions via a transition-based noise angle. It introduces two learning variants, CIQL-E (expert-focused) and CIQL-A (agent-focused), and demonstrates that penalizing noise yields better alignment with human intent than simple filtering. By recovering rewards with the inverse soft-Bellman operator and using occupancy measures, CIQL improves over the IQ-Learn baseline, achieving up to 40.3% average gains on linear tasks and showing strong reward-policy alignment (e.g., a -0.92 correlation for CIQL-A versus 0.46 for CIQL-E). The approach generalizes to multi-stage tasks like block stacking and transfers to Sim2Real, highlighting practical impact for learning from imperfect demonstrations in robotics while emphasizing the importance of noise handling and confidence estimation. to is identified as the optimal noise-angle range for linear tasks, and penalizing noise is found to better reflect human intent than filtering alone.

Abstract

Imitation learning attracts much attention for its ability to allow robots to quickly learn human manipulation skills through demonstrations. However, in the real world, human demonstrations often exhibit random behavior that is not intended by humans. Collecting high-quality human datasets is both challenging and expensive. Consequently, robots need to have the ability to learn behavioral policies that align with human intent from imperfect demonstrations. Previous work uses confidence scores to extract useful information from imperfect demonstrations, which relies on access to ground truth rewards or active human supervision. In this paper, we propose a transition-based method to obtain fine-grained confidence scores for data without the above efforts, which can increase the success rate of the baseline algorithm by 40.3 on average. We develop a generalized confidence-based imitation learning framework for guiding policy learning, called Confidence-based Inverse soft-Q Learning (CIQL), as shown in Fig.1. Based on this, we analyze two ways of processing noise and find that penalization is more aligned with human intent than filtering.
Paper Structure (14 sections, 15 equations, 8 figures, 2 tables, 1 algorithm)

This paper contains 14 sections, 15 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of Confidence-based Inverse soft-Q Learning: 1) Evaluate each transition $(s,a,s')$ using the confidence function $w$ and estimate the optimal prior probability $\alpha$. 2) Look for the optimal Q-function by CIQL and search for the optimal policy by Soft Actor-Critic (SAC). 3) Recover the reward function $r(s,a)$ using the inverse soft Bellman operator $\mathcal{T}^{\pi}$ and evaluate whether it aligns human intent.
  • Figure 2: Noise Angle Setting.$\tau_1,\tau_2,\tau_3$ are the complete trajectories. $\tilde{s}_0$ and $o$ are the initial positions of gripper and target, respectively. $\theta$ is the approach angle between vectors ${\tilde{s}_to}$ and ${\tilde{s}_t\tilde{s}_{t+1}}$, and $\theta_{n}$ is the noise angle. We evaluate each transition $(s_t,a_t,s_{t+1})$ using the Eq.(\ref{['condition']}) as a criterion.
  • Figure 3: Demonstration collection and processing. (a) The system control frequency is 20Hz and the task time limit is 25 seconds, i.e., the length of the trajectory is limited to 500 steps. The intervals of length setting for the different datasets are, Better datasets: 100-150; Worse datasets: 200-400; Failed datasets: 500. (b) Noise filtering visualization of two human datasets, Better and Worse.
  • Figure 4: Effect of Noise Angle. IQ-Learn: Baseline algorithm; IQ-Learn (filter): Filtering noise without using confidence, it becomes IQ-Learn when $\theta_n$ is set to 180°. CIQL-E: Filtering noise and using confidence; CIQL-A: Penalizing noise and using confidence.
  • Figure 5: Algorithm Performance. We evaluate the policy model using the average success rate of 10 random seeds, each testing 100 trajectories (1000 trajectories in total). We use all combinations of the datasets in TABLE \ref{['table1']} and the yellow triangle represents the average success rate.
  • ...and 3 more figures