Adapting LLM Agents with Universal Feedback in Communication
Kuan Wang, Yadong Lu, Michael Santacroce, Yeyun Gong, Chao Zhang, Yelong Shen
TL;DR
The paper introduces Learning Through Communication (LTC), a universal framework for training LLM agents with both linguistic feedback and non-linguistic rewards. It builds a universal replay buffer that stores trajectories as token sequences with source masks and rewards, and implements an iterative two-phase pipeline: Exploration to collect diverse data across single- and multi-agent settings, followed by Updating to optimize a joint objective that blends language modeling with PPO-based reinforcement learning. LTC supports three task-specific communication patterns—Single-agent Monologue, Multi-agent Dialogue, and Teacher-student Dialogue—to enable flexible, scalable learning across ALFWorld, HotpotQA, Chameleon, and GSM8k. Empirically, LTC outperforms instruction-tuning baselines by 3.6–12% across these datasets, demonstrating improved adaptability and efficiency, though performance gains on larger-model baselines in numerical reasoning may depend on model scale. This work highlights a practical, low-supervision approach to online adaptation of LLM agents through structured communication and feedback.”
Abstract
Recent advances in large language models (LLMs) have demonstrated potential for LLM agents. To facilitate the training for these agents with both linguistic feedback and non-linguistic reward signals, we introduce Learning through Communication (LTC). We design a universal buffer to store all the feedback, and an iterative pipeline to enable an LLM agent to explore and update its policy in an given environment. To optimize agent interactions for task-specific learning with our universal buffer and pipeline, we introduce diverse communication patterns tailored for both single-agent and multi-agent environments. We evaluate the efficacy of our LTC approach on four diverse datasets: ALFWorld (single-agent), HotpotQA (multi-agent collaboration), Chameleon (multi-agent competition), and GSM8k (multi-agent teacher-student). On these data sets, LTC outperforms the supervised instruction fine-tuning baselines by 3.6% to 12%. These results highlight the versatility and efficiency of LTC in facilitating online adaptation for LLM agents.
