Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement

Muning Wen; Junwei Liao; Cheng Deng; Jun Wang; Weinan Zhang; Ying Wen

Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement

Muning Wen, Junwei Liao, Cheng Deng, Jun Wang, Weinan Zhang, Ying Wen

TL;DR

The paper addresses the misalignment between reinforcement learning objectives and language modeling when tuning language agents for interactive tasks. It introduces Entropy-Regularized Token-Level Policy Optimization (ETPO), which decomposes action-level optimization into per-token updates using a per-token soft Bellman backup under an entropy-regularized objective. The authors prove optimization-consistency between token-level and full-action objectives and demonstrate that token-level decomposition reduces exploration complexity to a linear scale in context length while yielding robust improvements on a data-science code-generation task with CodeLlama-7B. Empirically, ETPO achieves stable performance gains over action-level baselines with minimal impact on perplexity, offering a practical path to enhancing interactive decision-making capabilities in language agents; the work also discusses limitations and avenues for future enhancements like self-reward signals and hindsight relabeling.

Abstract

Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. Traditional approaches often depend on meticulously designed prompts, high-quality examples, or additional reward models for in-context learning, supervised fine-tuning, or RLHF. Reinforcement learning (RL) presents a dynamic alternative for LLMs to overcome these dependencies by engaging directly with task-specific environments. Nonetheless, it faces significant hurdles: 1) instability stemming from the exponentially vast action space requiring exploration; 2) challenges in assigning token-level credit based on action-level reward signals, resulting in discord between maximizing rewards and accurately modeling corpus data. In response to these challenges, we introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. At the heart of ETPO is our novel per-token soft Bellman update, designed to harmonize the RL process with the principles of language modeling. This methodology decomposes the Q-function update from a coarse action-level view to a more granular token-level perspective, backed by theoretical proof of optimization consistency. Crucially, this decomposition renders linear time complexity in action exploration. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks; results underline ETPO's potential as a robust method for refining the interactive decision-making capabilities of language agents. For a more detailed preliminary work describing our motivation for token-level decomposition and applying it in PPO methods, please refer to arXiv:2405.15821.

Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement

TL;DR

Abstract

Paper Structure (26 sections, 17 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 26 sections, 17 equations, 10 figures, 5 tables, 1 algorithm.

Introduction
Related Works
LLMs for Interactive Tasks
Entropy-Regularized Reinforcement Learning
Preliminaries
Language-Augmented Sequential Decision-Making
Entropy-Regularized Reinforcement Learning
Action-Level and Token-Level Policy Optimization
Entropy-Regularized Token-Level Policy Optimization
Per-Token Soft Bellman Updates
Per-Token Policy Update
A Practical Algorithm
Experiments
Environmental Setup
Main Results
...and 11 more sections

Figures (10)

Figure 1: Exemplary run of ETPO on the Balance Scale dataset. This figure illustrates the evolution of code generated by CodeLlama-7B during the experiment where the environment step means the number of interactions. We've highlighted the positive changes that resulted in performance improvements in each iteration, e.g. finding better models or hyper-parameters. This investigation implies that ETPO can guide models for the emergence of complex yet more effective behaviors.
Figure 2: The overall pipeline of ETPO. In each time step, the LLM agent receives a state from the interactive environment. Then it generates an action token-by-token until the action is ready and being executed, i.e. $j=|a|$. Then the action will be separated into a sequence of tokens and their Q values will be updated following the per-token soft Bellman update scheme, where the $\mathbb{E}[\cdot]-\beta D_{KL}$ and $r+\gamma(\mathbb{E}[\cdot]-\beta D_{KL})$ are respectively corresponding to the different cases in Equation \ref{['equ_per_token_soft_bellman']}. The LLM policy will also be updated toward minimizing the KL divergence between it and the soft Q-network.
Figure 3: Average of the best reward explored across 14 datasets corresponding to different environmental steps $k$. "REF" indicates the Reflection baselines.
Figure 4: Average performance comparison during training on all Kaggle datasets that are lesser-known.
Figure 5: Detailed pipeline of applying ETPO to the data science code generation task, where the yellow curve demonstrates the sampling process and the red curve indicates the training loop. The dotted line means the soft Q network is initialized with the policy network, i.e. the LLM, at the beginning. The code generated by LLM agents will be used to replace the "FILL_ME" component in the states.
...and 5 more figures

Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement

TL;DR

Abstract

Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement

Authors

TL;DR

Abstract

Table of Contents

Figures (10)