Table of Contents
Fetching ...

Masked IRL: LLM-Guided Reward Disambiguation from Demonstrations and Language

Minyoung Hwang, Alexandra Forsey-Smerek, Nathaniel Dennler, Andreea Bobu

TL;DR

Masked IRL addresses the ill-posed nature of reward learning from demonstrations by jointly leveraging demonstrations and natural language through LLM-guided state-relevance masking and instruction disambiguation. It introduces a masking loss that enforces invariance to irrelevant state components and uses FiLM conditioning to integrate language into the reward model. When language is underspecified, demonstrations and LLM reasoning ground the instructions, enabling robust learning with fewer demonstrations. Across simulation and real-robot experiments, Masked IRL achieves up to 4.7× data savings and up to 15% performance gains, demonstrating improved sample efficiency, generalization, and robustness to language ambiguity in language-conditioned IRL.

Abstract

Robots can adapt to user preferences by learning reward functions from demonstrations, but with limited data, reward models often overfit to spurious correlations and fail to generalize. This happens because demonstrations show robots how to do a task but not what matters for that task, causing the model to focus on irrelevant state details. Natural language can more directly specify what the robot should focus on, and, in principle, disambiguate between many reward functions consistent with the demonstrations. However, existing language-conditioned reward learning methods typically treat instructions as simple conditioning signals, without fully exploiting their potential to resolve ambiguity. Moreover, real instructions are often ambiguous themselves, so naive conditioning is unreliable. Our key insight is that these two input types carry complementary information: demonstrations show how to act, while language specifies what is important. We propose Masked Inverse Reinforcement Learning (Masked IRL), a framework that uses large language models (LLMs) to combine the strengths of both input types. Masked IRL infers state-relevance masks from language instructions and enforces invariance to irrelevant state components. When instructions are ambiguous, it uses LLM reasoning to clarify them in the context of the demonstrations. In simulation and on a real robot, Masked IRL outperforms prior language-conditioned IRL methods by up to 15% while using up to 4.7 times less data, demonstrating improved sample-efficiency, generalization, and robustness to ambiguous language. Project page: https://MIT-CLEAR-Lab.github.io/Masked-IRL and Code: https://github.com/MIT-CLEAR-Lab/Masked-IRL

Masked IRL: LLM-Guided Reward Disambiguation from Demonstrations and Language

TL;DR

Masked IRL addresses the ill-posed nature of reward learning from demonstrations by jointly leveraging demonstrations and natural language through LLM-guided state-relevance masking and instruction disambiguation. It introduces a masking loss that enforces invariance to irrelevant state components and uses FiLM conditioning to integrate language into the reward model. When language is underspecified, demonstrations and LLM reasoning ground the instructions, enabling robust learning with fewer demonstrations. Across simulation and real-robot experiments, Masked IRL achieves up to 4.7× data savings and up to 15% performance gains, demonstrating improved sample efficiency, generalization, and robustness to language ambiguity in language-conditioned IRL.

Abstract

Robots can adapt to user preferences by learning reward functions from demonstrations, but with limited data, reward models often overfit to spurious correlations and fail to generalize. This happens because demonstrations show robots how to do a task but not what matters for that task, causing the model to focus on irrelevant state details. Natural language can more directly specify what the robot should focus on, and, in principle, disambiguate between many reward functions consistent with the demonstrations. However, existing language-conditioned reward learning methods typically treat instructions as simple conditioning signals, without fully exploiting their potential to resolve ambiguity. Moreover, real instructions are often ambiguous themselves, so naive conditioning is unreliable. Our key insight is that these two input types carry complementary information: demonstrations show how to act, while language specifies what is important. We propose Masked Inverse Reinforcement Learning (Masked IRL), a framework that uses large language models (LLMs) to combine the strengths of both input types. Masked IRL infers state-relevance masks from language instructions and enforces invariance to irrelevant state components. When instructions are ambiguous, it uses LLM reasoning to clarify them in the context of the demonstrations. In simulation and on a real robot, Masked IRL outperforms prior language-conditioned IRL methods by up to 15% while using up to 4.7 times less data, demonstrating improved sample-efficiency, generalization, and robustness to ambiguous language. Project page: https://MIT-CLEAR-Lab.github.io/Masked-IRL and Code: https://github.com/MIT-CLEAR-Lab/Masked-IRL

Paper Structure

This paper contains 15 sections, 4 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview. Demonstrations show how to complete a task, but the same demonstration can be supported by many reward hypotheses. Language can be leveraged to disambiguate what matters in the environment. Even when both the demonstration (blue trajectory) and the associated instruction (e.g., “Stay away”) are individually ambiguous, when reasoning jointly about the pair they can often disambiguate each other, revealing the intended preference (“Stay away from the laptop”).
  • Figure 2: System Overview. We clarify ambiguous language instructions using demonstrations and LLM reasoning. We then map disambiguated instructions into state masks, which guide the reward model through a masking loss that enforces invariance to irrelevant state dimensions during training. We train the reward model with the weighted sum of the masking loss and the IRL loss. Using the learned reward model, we can perform trajectory optimization by selecting the trajectory with the highest reward.
  • Figure 3: Performance Across Reward Densities. The average win rate of across all methods for different reward densities after (a) pretraining on 40 train preferences for 1k epochs and (b) fine-tuning on 30 test preferences for 100 epochs. All models are trained with 10 demonstrations per user preference and evaluated with unseen trajectories with novel object configurations. The shaded region indicates standard error across five different seeds.
  • Figure 4: Performance on ambiguous language. AI and DI denote models trained with ambiguous and disambiguated instructions, respectively. While LC-RL naively uses language only to condition the reward model, both Explicit Mask and Masked IRL significantly outperform LC-RL on train preferences, showing the benefit of using language to mask out irrelevant state dimensions. On test preferences, both language disambiguation and masking are important, where Masked IRL using disambiguated instructions show the highest performance.
  • Figure 5: Zero-shot Performance on Test Preferences with Real Robot. Masked IRL achieves higher win rates, lower reward variance given perturbation on irrelevant state dimensions, and lower win rates on optimized trajectories than baselines, showing its effectiveness in transferring to novel preferences without additional training.