Reinforcing Language Agents via Policy Optimization with Action Decomposition

Muning Wen; Ziyu Wan; Weinan Zhang; Jun Wang; Ying Wen

Reinforcing Language Agents via Policy Optimization with Action Decomposition

Muning Wen, Ziyu Wan, Weinan Zhang, Jun Wang, Ying Wen

TL;DR

The paper tackles the inefficiency of action-level credit assignment in language-augmented RL by introducing Action Decomposition Backup (BAD), which distributes credit to intra-action tokens. By integrating BAD into PPO, it yields Policy Optimization with Action Decomposition (POAD), enabling fine-grained token-level supervision that remains faithful to the original action-level RL objective. Empirical results across restricted and unrestricted action spaces show improved learning speed, stability, and generalization, with strong performance on open-vocabulary tasks and minimal impact on language abilities. The work offers a scalable framework for language agents in complex environments and provides theoretical insights into the discrepancy between token- and action-level optimizations.

Abstract

Language models as intelligent agents push the boundaries of sequential decision-making agents but struggle with limited knowledge of environmental dynamics and exponentially huge action space. Recent efforts like GLAM and TWOSOME manually constrain the action space to a restricted subset and employ reinforcement learning to align agents' knowledge with specific environments. However, they overlook fine-grained credit assignments for intra-action tokens, which is essential for efficient language agent optimization, and rely on human's prior knowledge to restrict action space. This paper proposes decomposing language agent optimization from the action level to the token level, offering finer supervision for each intra-action token and manageable optimization complexity in environments with unrestricted action spaces. Beginning with the simplification of flattening all actions, we theoretically explore the discrepancies between action-level optimization and this naive token-level optimization. We then derive the Bellman backup with Action Decomposition (BAD) to integrate credit assignments for both intra-action and inter-action tokens, effectively eliminating the discrepancies. Implementing BAD within the PPO algorithm, we introduce Policy Optimization with Action Decomposition (POAD). POAD benefits from a finer-grained credit assignment process and lower optimization complexity, leading to enhanced learning efficiency and generalization abilities in aligning language agents with interactive environments. We validate POAD across diverse testbeds, with results affirming the advantages of our approach and the correctness of our theoretical analysis.

Reinforcing Language Agents via Policy Optimization with Action Decomposition

TL;DR

Abstract

Paper Structure (38 sections, 20 equations, 12 figures, 17 tables, 1 algorithm)

This paper contains 38 sections, 20 equations, 12 figures, 17 tables, 1 algorithm.

Introduction
Related Works
Preliminaries
Language-augmented RL
Action-level Policy Optimization
From Actions to Tokens: Naive Token-level Policy Optimization
Naive Token-level Policy Optimization
The Discrepancy
Action-Decomposition Reinforcement Learning
Bellman Backup with Action-Decomposition
Policy Optimization with Action Decomposition
Experiments
Environmental Setup
Baseline Methods
Main Results
...and 23 more sections

Figures (12)

Figure 1: A Case to demonstrate: (a) the necessity of aligning language agents with environments to exclude the wrong option, since the agent does not initially know that "coffee table is empty". (b) Action-level optimization is uncertain to what extent the key tokens, i.e. $\mathbb{P}(\text{"kitchen"}|p,\text{"Walk to"})$, will be enhanced when optimizing the joint probability $\mathbb{P}(\text{"Walk to kitchen"}|p)$.
Figure 2: Visual comparison of the differences between action-level Bellman backup (left) and our BAD (right), given the goal turn on the TV, where $q$ is the action or token value estimations, $\delta_{t}=q_{t}-q_{t-1}$ and $\delta_t^{j}=q_t^{j}-q_t^{j-1}$ represent the credit assigned to corresponding actions and tokens respectively for policy update, e.g. the advantage valuebabaeizadeh2016reinforcement. To facilitate understanding, a step-by-step breakdown of the right figure is provided in Appendix \ref{['sec:break-bad']}.
Figure 3: Performance comparisons on Overcooked (first two) and VirtualHome (last two).
Figure 4: While TWOSOME does not support open action space tasks, we compare the average performance between POAD and TPPO on the DataSciCoding benchmarks, as well as POAD-Best the performance of best code explored by POAD during the training phase and CAAFE with GPT-4.
Figure 5: Ablation on $\gamma_a\in\{1.0,0.95\}$ for both TWOSOME and POAD (left), and $\gamma_w\in\{0.95,0.9,0.8,0.5\}$ for NTPO while keeping the $\gamma_a=0.95$ unchanged (right). In the left figure, Setting $\gamma_a=1.0$ led to decreased performance and convergence for TWOSOME and POAD, validating necessity of $\gamma_a<1.0$. While in the right figure, the increasingly larger performance gap between POAD and NTPO, as $\gamma_w$ decreases, verifies the theoretical analysis in Section \ref{['sec:method_discrepancy']}.
...and 7 more figures

Reinforcing Language Agents via Policy Optimization with Action Decomposition

TL;DR

Abstract

Reinforcing Language Agents via Policy Optimization with Action Decomposition

Authors

TL;DR

Abstract

Table of Contents

Figures (12)