Table of Contents
Fetching ...

RLHFless: Serverless Computing for Efficient RLHF

Rui Wei, Hanfei Yu, Shubham Jain, Yogarajan Sivakumar, Devesh Tiwari, Jian Li, Seung-Jong Park, Hao Wang

TL;DR

RLHFless is presented, the first scalable training framework for synchronous RLHF, built on serverless computing environments and achieves up to 1.35x speedup and 44.8% cost reduction compared to the state-of-the-art baseline.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has been widely applied to Large Language Model (LLM) post-training to align model outputs with human preferences. Recent models, such as DeepSeek-R1, have also shown RLHF's potential to improve LLM reasoning on complex tasks. In RL, inference and training co-exist, creating dynamic resource demands throughout the workflow. Compared to traditional RL, RLHF further challenges training efficiency due to expanding model sizes and resource consumption. Several RLHF frameworks aim to balance flexible abstraction and efficient execution. However, they rely on serverful infrastructures, which struggle with fine-grained resource variability. As a result, during synchronous RLHF training, idle time between or within RL components often causes overhead and resource wastage. To address these issues, we present RLHFless, the first scalable training framework for synchronous RLHF, built on serverless computing environments. RLHFless adapts to dynamic resource demands throughout the RLHF pipeline, pre-computes shared prefixes to avoid repeated computation, and uses a cost-aware actor scaling strategy that accounts for response length variation to find sweet spots with lower cost and higher speed. In addition, RLHFless assigns workloads efficiently to reduce intra-function imbalance and idle time. Experiments on both physical testbeds and a large-scale simulated cluster show that RLHFless achieves up to 1.35x speedup and 44.8% cost reduction compared to the state-of-the-art baseline.

RLHFless: Serverless Computing for Efficient RLHF

TL;DR

RLHFless is presented, the first scalable training framework for synchronous RLHF, built on serverless computing environments and achieves up to 1.35x speedup and 44.8% cost reduction compared to the state-of-the-art baseline.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has been widely applied to Large Language Model (LLM) post-training to align model outputs with human preferences. Recent models, such as DeepSeek-R1, have also shown RLHF's potential to improve LLM reasoning on complex tasks. In RL, inference and training co-exist, creating dynamic resource demands throughout the workflow. Compared to traditional RL, RLHF further challenges training efficiency due to expanding model sizes and resource consumption. Several RLHF frameworks aim to balance flexible abstraction and efficient execution. However, they rely on serverful infrastructures, which struggle with fine-grained resource variability. As a result, during synchronous RLHF training, idle time between or within RL components often causes overhead and resource wastage. To address these issues, we present RLHFless, the first scalable training framework for synchronous RLHF, built on serverless computing environments. RLHFless adapts to dynamic resource demands throughout the RLHF pipeline, pre-computes shared prefixes to avoid repeated computation, and uses a cost-aware actor scaling strategy that accounts for response length variation to find sweet spots with lower cost and higher speed. In addition, RLHFless assigns workloads efficiently to reduce intra-function imbalance and idle time. Experiments on both physical testbeds and a large-scale simulated cluster show that RLHFless achieves up to 1.35x speedup and 44.8% cost reduction compared to the state-of-the-art baseline.
Paper Structure (23 sections, 7 equations, 17 figures)

This paper contains 23 sections, 7 equations, 17 figures.

Figures (17)

  • Figure 1: RLHF's dataflow in one training step, including the generation phase, preparation phase, and learning phase. Some details, like the use of the critic model, can vary depending on the specific algorithm used.
  • Figure 2: (a) RLHF's staged workflow causes idle time between components. (b) Idle components and (c) repeated calculation in RLHF lead to resource wastage.
  • Figure 3: Existing generation strategies Sheng2025hybridflowhu2024openrlhfeasytousescalablehighperformancezheng2024sglangefficientexecutionstructuredkwon2023vllm for sampling-heavy RLHF workloads, including (a) iterative generation, which introduces additional latency, and (b) parallel generation, which leads to (c) unnecessary KV recalculation.
  • Figure 4: Dynamic resource demands caused by: (a) significantly varying response lengths across and within different types of datasets, such as mathematical problems aime2024, general scientific questions rein2024gpqa, and coding tasks jain2024livecodebench; (b) changing response lengths during RLHF training.
  • Figure 5: RLHFless's overview.
  • ...and 12 more figures