Training Agents with Weakly Supervised Feedback from Large Language Models
Dihong Gong, Pu Lu, Zelong Wang, Meng Zhou, Xiuqiang He
TL;DR
This work tackles scalable training of LLM-based agents without expert demonstrations or definitive environmental rewards by introducing a weakly supervised critic LLM that scores agent trajectories. The approach uses an iterative loop of trajectory sampling, critic evaluation, and supervised fine-tuning, formalized with a trajectory set $T$ across $N$ instructions and $K$ samples per instruction, and optimized via a negative log-likelihood objective. Experiments on API-Bank show that open-source models such as Yi-6B and Llama2-13B can achieve performance approaching GPT-4, highlighting the practical potential of weak supervision for multi-domain API usage. Overall, the method reduces reliance on labeled data and demonstrates scalable, incremental learning for task-oriented agents with strong cross-domain capabilities.
Abstract
Large Language Models (LLMs) offer a promising basis for creating agents that can tackle complex tasks through iterative environmental interaction. Existing methods either require these agents to mimic expert-provided trajectories or rely on definitive environmental feedback for reinforcement learning which limits their application to specific scenarios like gaming or code generation. This paper introduces a novel training method for LLM-based agents using weakly supervised signals from a critic LLM, bypassing the need for expert trajectories or definitive feedback. Our agents are trained in iterative manner, where they initially generate trajectories through environmental interaction. Subsequently, a critic LLM selects a subset of good trajectories, which are then used to update the agents, enabling them to generate improved trajectories in the next iteration. Extensive tests on the API-bank dataset show consistent improvement in our agents' capabilities and comparable performance to GPT-4, despite using open-source models with much fewer parameters.
