Table of Contents
Fetching ...

Trinity-RFT: A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of Large Language Models

Xuchen Pan, Yanxi Chen, Yushuo Chen, Yuchang Sun, Daoyuan Chen, Wenhao Zhang, Yuexiang Xie, Yilun Huang, Yilei Zhang, Dawei Gao, Weijie Shi, Yaliang Li, Bolin Ding, Jingren Zhou

TL;DR

Trinity-RFT presents a general-purpose, unified framework for reinforcement fine-tuning of large language models, unifying diverse RL modes through a decoupled RFT-core and robust agent–environment interaction. Its data pipelines, experience buffers, and human-in-the-loop capabilities enable flexible curriculum learning, online reward shaping, and scalable experiments. The work demonstrates practical implementations, including MIX-style algorithm integration, and showcases performance profiling and real learning with GRPO across tasks like GSM8k and ALFWorld. Together, these contributions offer a scalable, low-friction platform to explore and deploy advanced reinforcement learning paradigms for LLMs, bridging research and real-world applications.

Abstract

Trinity-RFT is a general-purpose, unified and easy-to-use framework designed for reinforcement fine-tuning (RFT) of large language models. It is built with a modular and decoupled design, consisting of (1) an RFT-core that unifies and generalizes synchronous/asynchronous, on-policy/off-policy, and online/offline modes of RFT; (2) seamless integration for agent-environment interaction with high efficiency and robustness; and (3) systematic data pipelines optimized for RFT. Trinity-RFT can be easily adapted for diverse application scenarios, and serves as a unified platform for development and research of advanced reinforcement learning paradigms at both macroscopic and microscopic levels. This technical report outlines the vision, features, design and implementations of Trinity-RFT, accompanied by extensive examples, applications and experiments that demonstrate its functionalities and user-friendliness.

Trinity-RFT: A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of Large Language Models

TL;DR

Trinity-RFT presents a general-purpose, unified framework for reinforcement fine-tuning of large language models, unifying diverse RL modes through a decoupled RFT-core and robust agent–environment interaction. Its data pipelines, experience buffers, and human-in-the-loop capabilities enable flexible curriculum learning, online reward shaping, and scalable experiments. The work demonstrates practical implementations, including MIX-style algorithm integration, and showcases performance profiling and real learning with GRPO across tasks like GSM8k and ALFWorld. Together, these contributions offer a scalable, low-friction platform to explore and deploy advanced reinforcement learning paradigms for LLMs, bridging research and real-world applications.

Abstract

Trinity-RFT is a general-purpose, unified and easy-to-use framework designed for reinforcement fine-tuning (RFT) of large language models. It is built with a modular and decoupled design, consisting of (1) an RFT-core that unifies and generalizes synchronous/asynchronous, on-policy/off-policy, and online/offline modes of RFT; (2) seamless integration for agent-environment interaction with high efficiency and robustness; and (3) systematic data pipelines optimized for RFT. Trinity-RFT can be easily adapted for diverse application scenarios, and serves as a unified platform for development and research of advanced reinforcement learning paradigms at both macroscopic and microscopic levels. This technical report outlines the vision, features, design and implementations of Trinity-RFT, accompanied by extensive examples, applications and experiments that demonstrate its functionalities and user-friendliness.

Paper Structure

This paper contains 55 sections, 16 figures, 3 tables.

Figures (16)

  • Figure 1: The high-level design of Trinity-RFT.
  • Figure 2: The high-level design of data pipelines in Trinity-RFT.
  • Figure 3: The architecture of RFT-core in Trinity-RFT.
  • Figure 4: A visualization of diverse RFT modes supported by Trinity-RFT, including: (a) synchronous mode, with sync_interval=2; (b) one-step off-policy mode, with sync_interval=1 and sync_offset=1; (c) fully asynchronous mode, with sync_interval=2; (d) multi-explorer asynchronous mode, with sync_interval=2. The buffer supports, in principle, arbitrary management and sampling strategies for experiences.
  • Figure 5: The interaction of data processor and data buffers in Trinity-RFT, divided into two key stages. Left: Task Curation & Prioritization prepares the initial tasks for the explorer. Right: Experience Shaping processes the collected trajectories from the explorer before they are used by the trainer. The data processor is a central component that operates on different buffers at different stages.
  • ...and 11 more figures