Table of Contents
Fetching ...

Adapting LLM Agents with Universal Feedback in Communication

Kuan Wang, Yadong Lu, Michael Santacroce, Yeyun Gong, Chao Zhang, Yelong Shen

TL;DR

The paper introduces Learning Through Communication (LTC), a universal framework for training LLM agents with both linguistic feedback and non-linguistic rewards. It builds a universal replay buffer that stores trajectories as token sequences with source masks and rewards, and implements an iterative two-phase pipeline: Exploration to collect diverse data across single- and multi-agent settings, followed by Updating to optimize a joint objective that blends language modeling with PPO-based reinforcement learning. LTC supports three task-specific communication patterns—Single-agent Monologue, Multi-agent Dialogue, and Teacher-student Dialogue—to enable flexible, scalable learning across ALFWorld, HotpotQA, Chameleon, and GSM8k. Empirically, LTC outperforms instruction-tuning baselines by 3.6–12% across these datasets, demonstrating improved adaptability and efficiency, though performance gains on larger-model baselines in numerical reasoning may depend on model scale. This work highlights a practical, low-supervision approach to online adaptation of LLM agents through structured communication and feedback.”

Abstract

Recent advances in large language models (LLMs) have demonstrated potential for LLM agents. To facilitate the training for these agents with both linguistic feedback and non-linguistic reward signals, we introduce Learning through Communication (LTC). We design a universal buffer to store all the feedback, and an iterative pipeline to enable an LLM agent to explore and update its policy in an given environment. To optimize agent interactions for task-specific learning with our universal buffer and pipeline, we introduce diverse communication patterns tailored for both single-agent and multi-agent environments. We evaluate the efficacy of our LTC approach on four diverse datasets: ALFWorld (single-agent), HotpotQA (multi-agent collaboration), Chameleon (multi-agent competition), and GSM8k (multi-agent teacher-student). On these data sets, LTC outperforms the supervised instruction fine-tuning baselines by 3.6% to 12%. These results highlight the versatility and efficiency of LTC in facilitating online adaptation for LLM agents.

Adapting LLM Agents with Universal Feedback in Communication

TL;DR

The paper introduces Learning Through Communication (LTC), a universal framework for training LLM agents with both linguistic feedback and non-linguistic rewards. It builds a universal replay buffer that stores trajectories as token sequences with source masks and rewards, and implements an iterative two-phase pipeline: Exploration to collect diverse data across single- and multi-agent settings, followed by Updating to optimize a joint objective that blends language modeling with PPO-based reinforcement learning. LTC supports three task-specific communication patterns—Single-agent Monologue, Multi-agent Dialogue, and Teacher-student Dialogue—to enable flexible, scalable learning across ALFWorld, HotpotQA, Chameleon, and GSM8k. Empirically, LTC outperforms instruction-tuning baselines by 3.6–12% across these datasets, demonstrating improved adaptability and efficiency, though performance gains on larger-model baselines in numerical reasoning may depend on model scale. This work highlights a practical, low-supervision approach to online adaptation of LLM agents through structured communication and feedback.”

Abstract

Recent advances in large language models (LLMs) have demonstrated potential for LLM agents. To facilitate the training for these agents with both linguistic feedback and non-linguistic reward signals, we introduce Learning through Communication (LTC). We design a universal buffer to store all the feedback, and an iterative pipeline to enable an LLM agent to explore and update its policy in an given environment. To optimize agent interactions for task-specific learning with our universal buffer and pipeline, we introduce diverse communication patterns tailored for both single-agent and multi-agent environments. We evaluate the efficacy of our LTC approach on four diverse datasets: ALFWorld (single-agent), HotpotQA (multi-agent collaboration), Chameleon (multi-agent competition), and GSM8k (multi-agent teacher-student). On these data sets, LTC outperforms the supervised instruction fine-tuning baselines by 3.6% to 12%. These results highlight the versatility and efficiency of LTC in facilitating online adaptation for LLM agents.
Paper Structure (38 sections, 2 equations, 9 figures, 5 tables, 4 algorithms)

This paper contains 38 sections, 2 equations, 9 figures, 5 tables, 4 algorithms.

Figures (9)

  • Figure 1: The LTC framework is adept for both single-agent and multi-agent environments. Within these environments, agents have the capability to persistently engage in exploration and interaction to collect trajectories through various communication patterns. Concurrently, LTC facilitates the training of these agents utilizing the data acquired from their exploratory activities. This process enables the agents to autonomously adapt to their respective environments, negating the necessity for human supervision.
  • Figure 2: The buffer data is a serial of integer/float sequences. We treat each token id as the action in our reinforcement learning formula. We also save its corresponding mask indicating the source of the token, the value from the critic model, the log-prob indicating the log-likelihood when sampling the action and the reward from the environment/other agents.
  • Figure 3: LTC has an iterative two-phase framework. During the exploration phase, the agent proactively explores new environments and communicates with other agents, gathering the trajectories to update the replay buffer. Then the agent is trained for updating the policy in the updating phase.
  • Figure 4: The toy examples to demonstrate communication patterns: 1) the left figure is the Multi-agent Dialogue pattern, where two agent play different roles to collaborate on the task. The thinker agent is responsible for analyzing the situation and give suggestion to the actor agent who is responsible for making decisions. We can just assign the LTC agent to play the thinker agent when testing without GPT-4 agent. 2) the right figure is the Teacher-student Dialogue pattern, where the student agent starts with an initial answer to the current question, and then the teacher directly corrects the answer with a reward. To help the student improve ability instead of just memorizing the solution, the teacher will generate another analogous question to ask the student. Eventually, the student gives a new answer for this analogous question and gets a new reward signal from the teacher.
  • Figure 5: The accuracy curves of training.
  • ...and 4 more figures