Table of Contents
Fetching ...

LaMOuR: Leveraging Language Models for Out-of-Distribution Recovery in Reinforcement Learning

Chan Kim, Seung-Woo Seo, Seong-Woo Kim

TL;DR

LaMOuR tackles out-of-distribution recovery in reinforcement learning by leveraging LVLMs to describe OOD states, infer recovery actions, and generate dense reward codes that guide relearning back to valid states, all without relying on uncertainty estimation. It adds a language-model-guided policy consolidation to prevent forgetting the original task when recovery succeeds. Empirically, LaMOuR improves recovery efficiency and generalizes to complex environments such as humanoid locomotion and mobile manipulation, outperforming uncertainty-based and several reward-generation baselines in many scenarios. The approach offers a scalable pathway for robust recovery in high-dimensional RL applications, with potential future work in automatic OOD detection and anomaly-triggered retraining.

Abstract

Deep Reinforcement Learning (DRL) has demonstrated strong performance in robotic control but remains susceptible to out-of-distribution (OOD) states, often resulting in unreliable actions and task failure. While previous methods have focused on minimizing or preventing OOD occurrences, they largely neglect recovery once an agent encounters such states. Although the latest research has attempted to address this by guiding agents back to in-distribution states, their reliance on uncertainty estimation hinders scalability in complex environments. To overcome this limitation, we introduce Language Models for Out-of-Distribution Recovery (LaMOuR), which enables recovery learning without relying on uncertainty estimation. LaMOuR generates dense reward codes that guide the agent back to a state where it can successfully perform its original task, leveraging the capabilities of LVLMs in image description, logical reasoning, and code generation. Experimental results show that LaMOuR substantially enhances recovery efficiency across diverse locomotion tasks and even generalizes effectively to complex environments, including humanoid locomotion and mobile manipulation, where existing methods struggle. The code and supplementary materials are available at https://lamour-rl.github.io/.

LaMOuR: Leveraging Language Models for Out-of-Distribution Recovery in Reinforcement Learning

TL;DR

LaMOuR tackles out-of-distribution recovery in reinforcement learning by leveraging LVLMs to describe OOD states, infer recovery actions, and generate dense reward codes that guide relearning back to valid states, all without relying on uncertainty estimation. It adds a language-model-guided policy consolidation to prevent forgetting the original task when recovery succeeds. Empirically, LaMOuR improves recovery efficiency and generalizes to complex environments such as humanoid locomotion and mobile manipulation, outperforming uncertainty-based and several reward-generation baselines in many scenarios. The approach offers a scalable pathway for robust recovery in high-dimensional RL applications, with potential future work in automatic OOD detection and anomaly-triggered retraining.

Abstract

Deep Reinforcement Learning (DRL) has demonstrated strong performance in robotic control but remains susceptible to out-of-distribution (OOD) states, often resulting in unreliable actions and task failure. While previous methods have focused on minimizing or preventing OOD occurrences, they largely neglect recovery once an agent encounters such states. Although the latest research has attempted to address this by guiding agents back to in-distribution states, their reliance on uncertainty estimation hinders scalability in complex environments. To overcome this limitation, we introduce Language Models for Out-of-Distribution Recovery (LaMOuR), which enables recovery learning without relying on uncertainty estimation. LaMOuR generates dense reward codes that guide the agent back to a state where it can successfully perform its original task, leveraging the capabilities of LVLMs in image description, logical reasoning, and code generation. Experimental results show that LaMOuR substantially enhances recovery efficiency across diverse locomotion tasks and even generalizes effectively to complex environments, including humanoid locomotion and mobile manipulation, where existing methods struggle. The code and supplementary materials are available at https://lamour-rl.github.io/.

Paper Structure

This paper contains 26 sections, 8 equations, 16 figures, 4 tables, 1 algorithm.

Figures (16)

  • Figure 1: LaMOuR generates dense reward code to guide the agent’s recovery from an OOD state. Retraining with this reward enables the agent to return to states where it can successfully perform the original task.
  • Figure 2: Overview of recovery reward generation in LaMOuR: First, the LVLM generates a description of the OOD state based on the given OOD state image. Next, it infers the recovery behavior needed for the agent to return to a valid state. Finally, the LVLM generates reward and evaluation codes that align with the inferred recovery behavior.
  • Figure 3: OOD states in four modified MuJoCo and Complex environments.
  • Figure 4: Qualitative results of the retrained agent using LaMOuR in the PushChair environment (top) and the Humanoid environment (bottom).
  • Figure 5: Learning curves for the four MuJoCo environments during retraining, evaluated over five episodes every 5000 training steps. Darker lines and shaded areas denote the mean returns and standard deviations across five random seeds, respectively. The black dashed line represents the average return of the original policy on the original task from a valid state. It is calculated as the average return of the original policy over 100 episodes.
  • ...and 11 more figures