Table of Contents
Fetching ...

SAC-GLAM: Improving Online RL for LLM agents with Soft Actor-Critic and Hindsight Relabeling

Loris Gaven, Clement Romac, Thomas Carta, Sylvain Lamprier, Olivier Sigaud, Pierre-Yves Oudeyer

TL;DR

This work tackles the challenge of enabling autotelic, goal-driven learning for LLM-based agents by moving beyond on-policy RL. It introduces SAC-GLAM, an off-policy Soft Actor-Critic framework where an encoder-decoder LLM acts as the stochastic policy over discrete, environment-level actions (token sequences) and is paired with a critic MLP; the policy is guided by token-log-probabilities and improved via HER. Empirical results in Playground-Text show SAC-GLAM achieves superior sample efficiency over PPO-GLAM, with further gains when HER is applied, and ablations highlight architectural and hyperparameter choices that stabilize training. Overall, SAC-GLAM demonstrates the viability of off-policy RL with LLMs and HER, contributing a path toward autonomous autotelic LLM agents with practical sample and time efficiency improvements.

Abstract

The past years have seen Large Language Models (LLMs) strive not only as generative models but also as agents solving textual sequential decision-making tasks. When facing complex environments where their zero-shot abilities are insufficient, recent work showed online Reinforcement Learning (RL) could be used for the LLM agent to discover and learn efficient strategies interactively. However, most prior work sticks to on-policy algorithms, which greatly reduces the scope of methods such agents could use for both exploration and exploitation, such as experience replay and hindsight relabeling. Yet, such methods may be key for LLM learning agents, and in particular when designing autonomous intrinsically motivated agents sampling and pursuing their own goals (i.e. autotelic agents). This paper presents and studies an adaptation of Soft Actor-Critic and hindsight relabeling to LLM agents. Our method not only paves the path towards autotelic LLM agents that learn online but can also outperform on-policy methods in more classic multi-goal RL environments.

SAC-GLAM: Improving Online RL for LLM agents with Soft Actor-Critic and Hindsight Relabeling

TL;DR

This work tackles the challenge of enabling autotelic, goal-driven learning for LLM-based agents by moving beyond on-policy RL. It introduces SAC-GLAM, an off-policy Soft Actor-Critic framework where an encoder-decoder LLM acts as the stochastic policy over discrete, environment-level actions (token sequences) and is paired with a critic MLP; the policy is guided by token-log-probabilities and improved via HER. Empirical results in Playground-Text show SAC-GLAM achieves superior sample efficiency over PPO-GLAM, with further gains when HER is applied, and ablations highlight architectural and hyperparameter choices that stabilize training. Overall, SAC-GLAM demonstrates the viability of off-policy RL with LLMs and HER, contributing a path toward autonomous autotelic LLM agents with practical sample and time efficiency improvements.

Abstract

The past years have seen Large Language Models (LLMs) strive not only as generative models but also as agents solving textual sequential decision-making tasks. When facing complex environments where their zero-shot abilities are insufficient, recent work showed online Reinforcement Learning (RL) could be used for the LLM agent to discover and learn efficient strategies interactively. However, most prior work sticks to on-policy algorithms, which greatly reduces the scope of methods such agents could use for both exploration and exploitation, such as experience replay and hindsight relabeling. Yet, such methods may be key for LLM learning agents, and in particular when designing autonomous intrinsically motivated agents sampling and pursuing their own goals (i.e. autotelic agents). This paper presents and studies an adaptation of Soft Actor-Critic and hindsight relabeling to LLM agents. Our method not only paves the path towards autotelic LLM agents that learn online but can also outperform on-policy methods in more classic multi-goal RL environments.

Paper Structure

This paper contains 15 sections, 3 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: The SAC-GLAM method. (a) depicts the agent's architecture when an encoder-decoder LLM is used: the actor computes an action probability as the probability computed by the LLM of the action's tokens to follow the observation and goal concatenated in the prompt $p$, while the critic computes the Q-value for each action $a$ and the prompt $p$ with an MLP attached to the decoder's last hidden state. (b) illustrates the agent-environment interaction, where trajectories are generated and added to the replay buffer. We used an environment where a social partner relabels these trajectories with hindsight goals.
  • Figure 2: Performance comparison of SAC-GLAM and PPO-GLAM in the Playground-Text environment. We show the average success rate as a function of the number of steps (left) and the average success rate over time in seconds (right). The mean and standard deviation are calculated across 4 seeds.
  • Figure 3: An observation in the Playground-Text environment. All necessary information is provided in the observation, making the environment fully observable.
  • Figure 4: Proportion of each goal type. The distribution of goals is highly imbalanced, with sequential goals making up $98\%$ of the total.
  • Figure 5: Comparison of Critic Architectures. The blue curve represents the architecture where both the observation and action are inputs, producing a single Q-value. The orange curve corresponds to the architecture where only the observation is input, producing Q-values for each action. The left plot illustrates the case where gradients are backpropagated only through the MLP head, while the right plot shows the case where gradients propagate through both the MLP and shared parameters. Mean and standard deviation are computed across two seeds.
  • ...and 4 more figures