Table of Contents
Fetching ...

When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning

Juarez Monteiro, Nathan Gavenski, Gianlucca Zuin, Adriano Veloso

Abstract

Reinforcement learning (RL) agents often struggle with out-of-distribution (OOD) scenarios, leading to high uncertainty and random behavior. While language models (LMs) contain valuable world knowledge, larger ones incur high computational costs, hindering real-time use, and exhibit limitations in autonomous planning. We introduce Adaptive Safety through Knowledge (ASK), which combines smaller LMs with trained RL policies to enhance OOD generalization without retraining. ASK employs Monte Carlo Dropout to assess uncertainty and queries the LM for action suggestions only when uncertainty exceeds a set threshold. This selective use preserves the efficiency of existing policies while leveraging the language model's reasoning in uncertain situations. In experiments on the FrozenLake environment, ASK shows no improvement in-domain, but demonstrates robust navigation in transfer tasks, achieving a reward of 0.95. Our findings indicate that effective neuro-symbolic integration requires careful orchestration rather than simple combination, highlighting the need for sufficient model scale and effective hybridization mechanisms for successful OOD generalization.

When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning

Abstract

Reinforcement learning (RL) agents often struggle with out-of-distribution (OOD) scenarios, leading to high uncertainty and random behavior. While language models (LMs) contain valuable world knowledge, larger ones incur high computational costs, hindering real-time use, and exhibit limitations in autonomous planning. We introduce Adaptive Safety through Knowledge (ASK), which combines smaller LMs with trained RL policies to enhance OOD generalization without retraining. ASK employs Monte Carlo Dropout to assess uncertainty and queries the LM for action suggestions only when uncertainty exceeds a set threshold. This selective use preserves the efficiency of existing policies while leveraging the language model's reasoning in uncertain situations. In experiments on the FrozenLake environment, ASK shows no improvement in-domain, but demonstrates robust navigation in transfer tasks, achieving a reward of 0.95. Our findings indicate that effective neuro-symbolic integration requires careful orchestration rather than simple combination, highlighting the need for sufficient model scale and effective hybridization mechanisms for successful OOD generalization.

Paper Structure

This paper contains 16 sections, 2 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: Different paths for the same context, where purple is the LM, orange the PPO, and green the ASK approach.
  • Figure 2: In-domain performance of PPO, LM, and ASK on the FrozenLake environment across grid sizes for language model sizes ranging from 0.5B to 72B parameters. Values above markers denote mean reward over 100 episodes.
  • Figure 3: Schematic and image representations for the FrozenLake-v$1$ environment. In Figs. \ref{['fig:sub:schem4']} and \ref{['fig:sub:schem8']}, each cell corresponds to a Markovian state, and special tiles denote the start (S), goal (G), and terminal-failure states (H). These structured state descriptions are encoded as contextual information and incorporated into prompts, supporting decision-making in the PPO policy through gated interventions.