Table of Contents
Fetching ...

Agent Lightning: Train ANY AI Agents with Reinforcement Learning

Xufang Luo, Yuge Zhang, Zhiyuan He, Zilong Wang, Siyun Zhao, Dongsheng Li, Luna K. Qiu, Yuqing Yang

TL;DR

Agent Lightning presents a decoupled, agent-agnostic RL framework for training LLM-based agents by recasting agent execution as an MDP and unifying data into transition-style trajectories. It introduces LightningRL, a hierarchical RL approach that enables seamless use of existing single-turn RL methods, and a Training-Agent Disaggregation architecture that separates training from agent runtime. The framework supports robust data capture, AIR, and scalable rollout through a two-component server-client design, demonstrated across text-to-SQL, retrieval-augmented generation, and math-tool usage tasks with consistent improvements. This work offers a general, scalable path to real-world agent optimization, enabling continuous learning and deployment-ready agent capabilities with minimal code changes.

Abstract

We present Agent Lightning, a flexible and extensible framework that enables Reinforcement Learning (RL)-based training of Large Language Models (LLMs) for any AI agent. Unlike existing methods that tightly couple RL training with agent or rely on sequence concatenation with masking, Agent Lightning achieves complete decoupling between agent execution and training, allowing seamless integration with existing agents developed via diverse ways (e.g., using frameworks like LangChain, OpenAI Agents SDK, AutoGen, and building from scratch) with almost ZERO code modifications. By formulating agent execution as Markov decision process, we define an unified data interface and propose a hierarchical RL algorithm, LightningRL, which contains a credit assignment module, allowing us to decompose trajectories generated by ANY agents into training transition. This enables RL to handle complex interaction logic, such as multi-agent scenarios and dynamic workflows. For the system design, we introduce a Training-Agent Disaggregation architecture, and brings agent observability frameworks into agent runtime, providing a standardized agent finetuning interface. Experiments across text-to-SQL, retrieval-augmented generation, and math tool-use tasks demonstrate stable, continuous improvements, showcasing the framework's potential for real-world agent training and deployment.

Agent Lightning: Train ANY AI Agents with Reinforcement Learning

TL;DR

Agent Lightning presents a decoupled, agent-agnostic RL framework for training LLM-based agents by recasting agent execution as an MDP and unifying data into transition-style trajectories. It introduces LightningRL, a hierarchical RL approach that enables seamless use of existing single-turn RL methods, and a Training-Agent Disaggregation architecture that separates training from agent runtime. The framework supports robust data capture, AIR, and scalable rollout through a two-component server-client design, demonstrated across text-to-SQL, retrieval-augmented generation, and math-tool usage tasks with consistent improvements. This work offers a general, scalable path to real-world agent optimization, enabling continuous learning and deployment-ready agent capabilities with minimal code changes.

Abstract

We present Agent Lightning, a flexible and extensible framework that enables Reinforcement Learning (RL)-based training of Large Language Models (LLMs) for any AI agent. Unlike existing methods that tightly couple RL training with agent or rely on sequence concatenation with masking, Agent Lightning achieves complete decoupling between agent execution and training, allowing seamless integration with existing agents developed via diverse ways (e.g., using frameworks like LangChain, OpenAI Agents SDK, AutoGen, and building from scratch) with almost ZERO code modifications. By formulating agent execution as Markov decision process, we define an unified data interface and propose a hierarchical RL algorithm, LightningRL, which contains a credit assignment module, allowing us to decompose trajectories generated by ANY agents into training transition. This enables RL to handle complex interaction logic, such as multi-agent scenarios and dynamic workflows. For the system design, we introduce a Training-Agent Disaggregation architecture, and brings agent observability frameworks into agent runtime, providing a standardized agent finetuning interface. Experiments across text-to-SQL, retrieval-augmented generation, and math tool-use tasks demonstrate stable, continuous improvements, showcasing the framework's potential for real-world agent training and deployment.

Paper Structure

This paper contains 44 sections, 8 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Overview of Agent Lightning, a flexible and extensible framework that enables reinforcement learning of LLMs for ANY AI agents.
  • Figure 2: Illustration of the unified data interface in Agent Lightning. The left panel depicts the agent execution flow, where each state transition is represented by the update of semantic variables (green rectangles denote variables with valid values; gray rectangles indicate variables not yet assigned in the current state). The right panel presents the corresponding trajectory collected throughout the agent's execution, demonstrating how the unified data interface systematically captures all relevant transitions for RL-based optimization.
  • Figure 3: Illustration of the LightningRL algorithm. (a) Single-call GRPO, where the LLM generates a response to a task in one pass. Outputs for the same task are grouped together for advantage estimation. (b) Previous multi-turn GRPO. Each trajectory contains multiple LLM calls, with trajectories for the same task grouped for advantage estimation. Tokens not generated by the LLM are masked (gray dashed boxes) during optimization. (c) Our proposed LightningRL. Trajectories are decomposed into transitions, and transitions for the same task are grouped for advantage estimation. Each transition includes the current input/context, output, and reward. The input is part of the current agent state, with rewards computed by the credit assignment module.
  • Figure 4: Training-Agent Disaggregation architecture.
  • Figure 5: Reward curves for the Text-to-SQL task.
  • ...and 3 more figures