SAC-GLAM: Improving Online RL for LLM agents with Soft Actor-Critic and Hindsight Relabeling
Loris Gaven, Clement Romac, Thomas Carta, Sylvain Lamprier, Olivier Sigaud, Pierre-Yves Oudeyer
TL;DR
This work tackles the challenge of enabling autotelic, goal-driven learning for LLM-based agents by moving beyond on-policy RL. It introduces SAC-GLAM, an off-policy Soft Actor-Critic framework where an encoder-decoder LLM acts as the stochastic policy over discrete, environment-level actions (token sequences) and is paired with a critic MLP; the policy is guided by token-log-probabilities and improved via HER. Empirical results in Playground-Text show SAC-GLAM achieves superior sample efficiency over PPO-GLAM, with further gains when HER is applied, and ablations highlight architectural and hyperparameter choices that stabilize training. Overall, SAC-GLAM demonstrates the viability of off-policy RL with LLMs and HER, contributing a path toward autonomous autotelic LLM agents with practical sample and time efficiency improvements.
Abstract
The past years have seen Large Language Models (LLMs) strive not only as generative models but also as agents solving textual sequential decision-making tasks. When facing complex environments where their zero-shot abilities are insufficient, recent work showed online Reinforcement Learning (RL) could be used for the LLM agent to discover and learn efficient strategies interactively. However, most prior work sticks to on-policy algorithms, which greatly reduces the scope of methods such agents could use for both exploration and exploitation, such as experience replay and hindsight relabeling. Yet, such methods may be key for LLM learning agents, and in particular when designing autonomous intrinsically motivated agents sampling and pursuing their own goals (i.e. autotelic agents). This paper presents and studies an adaptation of Soft Actor-Critic and hindsight relabeling to LLM agents. Our method not only paves the path towards autotelic LLM agents that learn online but can also outperform on-policy methods in more classic multi-goal RL environments.
