Table of Contents
Fetching ...

Better than Your Teacher: LLM Agents that learn from Privileged AI Feedback

Sanjiban Choudhury, Paloma Sodhi

TL;DR

This work proposes LEAP, an iterative fine-tuning framework that continually improves LLM agents using feedback from AI expert teachers to equip the expert teachers with a privileged state -- information that is available during training but hidden at test time.

Abstract

While large language models (LLMs) show impressive decision-making abilities, current methods lack a mechanism for automatic self-improvement from errors during task execution. We propose LEAP, an iterative fine-tuning framework that continually improves LLM agents using feedback from AI expert teachers. Our key insight is to equip the expert teachers with a privileged state -- information that is available during training but hidden at test time. This allows even weak experts to provide precise guidance, significantly improving the student agent's performance without access to privileged information at test time. We evaluate LEAP on diverse decision-making benchmarks, including text-based games (ALFWorld), web navigation (WebShop), and interactive coding (Intercode Bash). Our experiments show that LEAP (1) outperforms behavior cloning and ReAct baselines (2) enables weak student models (e.g., Llama3-8B) to exceed the performance of strong teacher models (GPT4-o), and (3) allows weak models to self-improve using privileged versions of themselves. We also provide a theoretical analysis showing that LEAP's success hinges on balancing privileged information with the student's realizability, which we empirically validate. Our code is available at https://leap-llm.github.io

Better than Your Teacher: LLM Agents that learn from Privileged AI Feedback

TL;DR

This work proposes LEAP, an iterative fine-tuning framework that continually improves LLM agents using feedback from AI expert teachers to equip the expert teachers with a privileged state -- information that is available during training but hidden at test time.

Abstract

While large language models (LLMs) show impressive decision-making abilities, current methods lack a mechanism for automatic self-improvement from errors during task execution. We propose LEAP, an iterative fine-tuning framework that continually improves LLM agents using feedback from AI expert teachers. Our key insight is to equip the expert teachers with a privileged state -- information that is available during training but hidden at test time. This allows even weak experts to provide precise guidance, significantly improving the student agent's performance without access to privileged information at test time. We evaluate LEAP on diverse decision-making benchmarks, including text-based games (ALFWorld), web navigation (WebShop), and interactive coding (Intercode Bash). Our experiments show that LEAP (1) outperforms behavior cloning and ReAct baselines (2) enables weak student models (e.g., Llama3-8B) to exceed the performance of strong teacher models (GPT4-o), and (3) allows weak models to self-improve using privileged versions of themselves. We also provide a theoretical analysis showing that LEAP's success hinges on balancing privileged information with the student's realizability, which we empirically validate. Our code is available at https://leap-llm.github.io
Paper Structure (42 sections, 3 theorems, 19 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 42 sections, 3 theorems, 19 equations, 6 figures, 5 tables, 1 algorithm.

Key Result

Theorem 3.2

Running $N$ iterations of LEAP with privileged expert $\pi^E$ yields at least one policy $\pi$ such that the time-average performance $\frac{1}{T} J(\pi)$ is bounded as: where $\pi^E$ is the privileged expert, $H(\pi^E)$ is the recoverability coefficient of $\pi^E$, $\epsilon (\pi^E, T)$ is the average realizability gap, and $\gamma(N)$ is the average regret of the DAgger update.

Figures (6)

  • Figure 1: LEAP overview. LLM student agent interacts with the environment, generating a reason-action trajectory (in orange) based on its policy $\pi_{i-1}$. An expert teacher, with privileged state available only during training, evaluates and corrects the trajectory (in green). These corrections update the learner's policy to $\pi_{i}$ through SFT/DPO training. Updated policy $\pi_{i}$ is then rolled out at test time without access to privileged state.
  • Figure 2: ALFWorld Training and Testing for LEAP.(a)Training: Student policy $\pi_0$ rolled out on training task to generate reason and actions, e.g. it fails to cell phone because it inefficiently searches all drawers. Expert teacher $\pi^E$ uses privileged information (cellphone in desk1) to generate general corrected reason actions that don't reveal privileged information (cellphones commonly found in desks, more efficient to search desks than drawers). (b)Testing:$\pi_0$ fails to find a watch as it inefficently explores shelves one by one. $\pi_1$ learns a more efficient exploration policy, prioritizing areas like sidetables and dressers, solving the task quickly.
  • Figure 3: WebShop Evaluation.(a) Overall score$\uparrow$ and #act$\downarrow$ on $500$ test tasks (max $30$ actions). (b) Performance of LEAP over iterations on $4$ different score components. Baseline comparisons include [1] IL yao2022webshop and our ReAct instruction prompt with different models. LEAP with an 8B model across iterations ($\pi_2, \pi_3$) outperforms the stronger teacher ReActgpt-4o.
  • Figure 4: WebShop Training and Testing for LEAP.(a) Training: Teacher policy generates corrections on student rollout to backtrack when the product does not fit criteria or to commit to a product when it does. (b) Evaluation:$\pi_0$ fails to solve the task since it continues to search page after page even after discovering a good product. $\pi_1$ learns when to backtrack and when to commit to a product to solve the task in time.
  • Figure 5: Privileged Information vs Realizability.(a) Performance of $5$ policies trained with experts with varying levels of privileged information on ALFWorld, peaking for expert $\pi^E_3$. (b) Examples of corrections from experts $\pi^E_1, \pi^E_3, \pi^E_5$. $\pi^E_1$ generates realizable reason action but predicts wrong action. $\pi^E_5$ predicts correction action, but produces unrealizable reason action that contains privileged information. $\pi^E_3$ strikes a perfect balance.
  • ...and 1 more figures

Theorems & Definitions (10)

  • Definition 3.1: Average Realizability Gap
  • Theorem 3.2: LEAP with Privileged Expert
  • Definition B.1: Average Imitaiton Gap
  • Definition B.2: Average Realizability Gap
  • Definition B.3: Recoverability Coefficient of the Privileged Expert
  • Theorem B.4: LEAP with Privileged Expert
  • proof
  • Definition B.5: Constrained Privileged Expert
  • Theorem B.6: LEAP with Constrained Privileged Expert
  • proof