Aligning Human Intent from Imperfect Demonstrations with Confidence-based Inverse soft-Q Learning
Xizhou Bu, Wenjuan Li, Zhengxiong Liu, Zhiqiang Ma, Panfeng Huang
TL;DR
This work tackles learning from imperfect human demonstrations by proposing Confidence-based Inverse soft-Q Learning (CIQL), which assigns fine-grained confidence scores to transitions via a transition-based noise angle. It introduces two learning variants, CIQL-E (expert-focused) and CIQL-A (agent-focused), and demonstrates that penalizing noise yields better alignment with human intent than simple filtering. By recovering rewards with the inverse soft-Bellman operator and using occupancy measures, CIQL improves over the IQ-Learn baseline, achieving up to 40.3% average gains on linear tasks and showing strong reward-policy alignment (e.g., a -0.92 correlation for CIQL-A versus 0.46 for CIQL-E). The approach generalizes to multi-stage tasks like block stacking and transfers to Sim2Real, highlighting practical impact for learning from imperfect demonstrations in robotics while emphasizing the importance of noise handling and confidence estimation. $20^{\circ}$ to $60^{\circ}$ is identified as the optimal noise-angle range for linear tasks, and penalizing noise is found to better reflect human intent than filtering alone.
Abstract
Imitation learning attracts much attention for its ability to allow robots to quickly learn human manipulation skills through demonstrations. However, in the real world, human demonstrations often exhibit random behavior that is not intended by humans. Collecting high-quality human datasets is both challenging and expensive. Consequently, robots need to have the ability to learn behavioral policies that align with human intent from imperfect demonstrations. Previous work uses confidence scores to extract useful information from imperfect demonstrations, which relies on access to ground truth rewards or active human supervision. In this paper, we propose a transition-based method to obtain fine-grained confidence scores for data without the above efforts, which can increase the success rate of the baseline algorithm by 40.3$\%$ on average. We develop a generalized confidence-based imitation learning framework for guiding policy learning, called Confidence-based Inverse soft-Q Learning (CIQL), as shown in Fig.1. Based on this, we analyze two ways of processing noise and find that penalization is more aligned with human intent than filtering.
