Table of Contents
Fetching ...

Training Agents with Weakly Supervised Feedback from Large Language Models

Dihong Gong, Pu Lu, Zelong Wang, Meng Zhou, Xiuqiang He

TL;DR

This work tackles scalable training of LLM-based agents without expert demonstrations or definitive environmental rewards by introducing a weakly supervised critic LLM that scores agent trajectories. The approach uses an iterative loop of trajectory sampling, critic evaluation, and supervised fine-tuning, formalized with a trajectory set $T$ across $N$ instructions and $K$ samples per instruction, and optimized via a negative log-likelihood objective. Experiments on API-Bank show that open-source models such as Yi-6B and Llama2-13B can achieve performance approaching GPT-4, highlighting the practical potential of weak supervision for multi-domain API usage. Overall, the method reduces reliance on labeled data and demonstrates scalable, incremental learning for task-oriented agents with strong cross-domain capabilities.

Abstract

Large Language Models (LLMs) offer a promising basis for creating agents that can tackle complex tasks through iterative environmental interaction. Existing methods either require these agents to mimic expert-provided trajectories or rely on definitive environmental feedback for reinforcement learning which limits their application to specific scenarios like gaming or code generation. This paper introduces a novel training method for LLM-based agents using weakly supervised signals from a critic LLM, bypassing the need for expert trajectories or definitive feedback. Our agents are trained in iterative manner, where they initially generate trajectories through environmental interaction. Subsequently, a critic LLM selects a subset of good trajectories, which are then used to update the agents, enabling them to generate improved trajectories in the next iteration. Extensive tests on the API-bank dataset show consistent improvement in our agents' capabilities and comparable performance to GPT-4, despite using open-source models with much fewer parameters.

Training Agents with Weakly Supervised Feedback from Large Language Models

TL;DR

This work tackles scalable training of LLM-based agents without expert demonstrations or definitive environmental rewards by introducing a weakly supervised critic LLM that scores agent trajectories. The approach uses an iterative loop of trajectory sampling, critic evaluation, and supervised fine-tuning, formalized with a trajectory set across instructions and samples per instruction, and optimized via a negative log-likelihood objective. Experiments on API-Bank show that open-source models such as Yi-6B and Llama2-13B can achieve performance approaching GPT-4, highlighting the practical potential of weak supervision for multi-domain API usage. Overall, the method reduces reliance on labeled data and demonstrates scalable, incremental learning for task-oriented agents with strong cross-domain capabilities.

Abstract

Large Language Models (LLMs) offer a promising basis for creating agents that can tackle complex tasks through iterative environmental interaction. Existing methods either require these agents to mimic expert-provided trajectories or rely on definitive environmental feedback for reinforcement learning which limits their application to specific scenarios like gaming or code generation. This paper introduces a novel training method for LLM-based agents using weakly supervised signals from a critic LLM, bypassing the need for expert trajectories or definitive feedback. Our agents are trained in iterative manner, where they initially generate trajectories through environmental interaction. Subsequently, a critic LLM selects a subset of good trajectories, which are then used to update the agents, enabling them to generate improved trajectories in the next iteration. Extensive tests on the API-bank dataset show consistent improvement in our agents' capabilities and comparable performance to GPT-4, despite using open-source models with much fewer parameters.

Paper Structure

This paper contains 15 sections, 2 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Our self-evolving algorithm employs a comprehensive training pipeline to instruct LLMs in the utilization of APIs. The process begins with a set of instructions, which guide the actor module in interacting with an environment composed of various APIs, thereby generating a sequence of trials. Subsequently, the critic module is applied to discern a subset of trials where it perceives the actor has successfully executed the instruction. These successful trials are then forwarded to the trainer module, which updates the underlying actor module. To prevent overfitting, this update is supplemented with general chat data. This procedure is iteratively repeated, allowing the actor module to evolve and adapt to its environment.
  • Figure 2: Accuracy results under different number of training iterations.