MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention

Yuxin Chen; Chen Tang; Jianglan Wei; Chenran Li; Ran Tian; Xiang Zhang; Wei Zhan; Peter Stone; Masayoshi Tomizuka

MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention

Yuxin Chen, Chen Tang, Jianglan Wei, Chenran Li, Ran Tian, Xiang Zhang, Wei Zhan, Peter Stone, Masayoshi Tomizuka

TL;DR

This work tackles the problem of aligning a pre-trained prior policy with human preferences using interactive interventions. It introduces MEReQ, which infers a residual reward $r_{\mathrm{R}}$ capturing the dissonance between the human expert's reward and the prior policy's reward, and updates the policy via Residual Q-Learning to approximate the unknown expert reward. By combining Maximum-Entropy IRL with residual reward learning and leveraging pseudo-expert trajectories, MEReQ achieves superior sample efficiency and reduced human labor across simulated and real-world tasks. The approach enables practical, human-in-the-loop alignment of embodied agents with fewer interventions, advancing deployable interactive imitation learning.

Abstract

Aligning robot behavior with human preferences is crucial for deploying embodied AI agents in human-centered environments. A promising solution is interactive imitation learning from human intervention, where a human expert observes the policy's execution and provides interventions as feedback. However, existing methods often fail to utilize the prior policy efficiently to facilitate learning, thus hindering sample efficiency. In this work, we introduce MEReQ (Maximum-Entropy Residual-Q Inverse Reinforcement Learning), designed for sample-efficient alignment from human intervention. Instead of inferring the complete human behavior characteristics, MEReQ infers a residual reward function that captures the discrepancy between the human expert's and the prior policy's underlying reward functions. It then employs Residual Q-Learning (RQL) to align the policy with human preferences using this residual reward function. Extensive evaluations on simulated and real-world tasks demonstrate that MEReQ achieves sample-efficient policy alignment from human intervention.

MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention

TL;DR

This work tackles the problem of aligning a pre-trained prior policy with human preferences using interactive interventions. It introduces MEReQ, which infers a residual reward

capturing the dissonance between the human expert's reward and the prior policy's reward, and updates the policy via Residual Q-Learning to approximate the unknown expert reward. By combining Maximum-Entropy IRL with residual reward learning and leveraging pseudo-expert trajectories, MEReQ achieves superior sample efficiency and reduced human labor across simulated and real-world tasks. The approach enables practical, human-in-the-loop alignment of embodied agents with fewer interventions, advancing deployable interactive imitation learning.

Abstract

Paper Structure (25 sections, 17 equations, 13 figures, 9 tables, 1 algorithm)

This paper contains 25 sections, 17 equations, 13 figures, 9 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Policy Customization and Residual Q-Learning
Maximum-Entropy Inverse Reinforcement Learning
Problem Formulation
Max-Ent Residual-Q Inverse Reinforcement Learning (MEReQ)
A Naive Maximum-Entropy IRL Solution
Residual Reward Inference and Policy Update
Algorithm
Experiments
Experimental Results
Limitations and Future Work
Detailed Environment Settings
Learning from Synthesized Expert with Heuristic-based Intervention
...and 10 more sections

Figures (13)

Figure 1: Overview of MEReQ, designed for sample-efficient alignment from human intervention. From human intervention samples, MEReQ infers a residual reward that captures the discrepancy between the human expert’s and the prior policy’s underlying reward functions via maximum-entropy inverse reinforcement and then updates the prior policy with Residual Q-Learning (RQL).
Figure 2: Sample Efficiency.(Top)MEReQ converges faster and maintains at low intervention rate throughout the sample collection iterations. The error bands indicate a 95% confidence interval across 8 trials. (Bottom)MEReQ requires fewer total expert samples to achieve comparable policy performance compared to baselines under varying intervention rate thresholds $\delta$. The error bars indicate a 95% confidence interval. See Tab. \ref{['tab:sample_efficiency']} in Appendix \ref{['app:additional_results']} for detailed values.
Figure 3: Human Effort.MEReQ can effectively reduce human efforts. The error bands indicate a 95% confidence interval across 3 trials. See Tab. \ref{['tab:human_efforts']} in Appendix \ref{['app:additional_results']} for detailed values.
Figure 4: Left: Bottle-Pushing-Human Rollout. Before alignment, the robot knocks down the bottle with a high contact point. The robot pushes the bottle to the goal position with low contact point after alignment. Right: Pillow-Grasping-Human Rollout. Before alignment, the robot fails to grasp the pillow by the center. The robot grasps the pillow successfully after alignment.
Figure 5: Highway-Sim Sample Roll-out. The green box is the ego vehicle, and the blue boxes are the surrounding vehicles. The bird-eye-view bounding box follows the ego vehicle.
...and 8 more figures

MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention

TL;DR

Abstract

MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention

Authors

TL;DR

Abstract

Table of Contents

Figures (13)