Table of Contents
Fetching ...

QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs

Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, Yao Lu, Xiaojuan Qi, Song Han, Yukang Chen

TL;DR

QeRL introduces a quantization-enhanced RL framework for LLMs by uniting NVFP4 weight quantization with LoRA and an adaptive quantization noise mechanism. The approach leverages a Marlin-based kernel to accelerate rollout/prefill, and noise-sharing to inject exploration without extra parameters, yielding significant rollout and end-to-end speedups. Empirical results show QeRL matching or surpassing 16-bit LoRA and approaching full fine-tuning on math benchmarks across 3B–32B models, while enabling 32B RL training on a single H100-80GB GPU. This delivers a practical, memory-efficient pathway for RL training in large models with strong reasoning capabilities and improved exploration dynamics.

Abstract

We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for large language models (LLMs). While RL is essential for LLMs' reasoning capabilities, it is resource-intensive, requiring substantial GPU memory and long rollout durations. QeRL addresses these issues by combining NVFP4 quantization with Low-Rank Adaptation (LoRA), accelerating rollout phase of RL while reducing memory overhead. Beyond efficiency, our findings show that quantization noise increases policy entropy, enhancing exploration, and enabling the discovery of better strategies during RL. To further optimize exploration, QeRL introduces an Adaptive Quantization Noise (AQN) mechanism, which dynamically adjusts noise during training. Experiments demonstrate that QeRL delivers over 1.5 times speedup in the rollout phase. Moreover, this is the first framework to enable RL training of a 32B LLM on a single H100 80GB GPU, while delivering overall speedups for RL training. It also achieves faster reward growth and higher final accuracy than 16-bit LoRA and QLoRA, while matching the performance of full-parameter fine-tuning on mathematical benchmarks such as GSM8K (90.8%) and MATH 500 (77.4%) in the 7B model. These results establish QeRL as an efficient and effective framework for RL training in LLMs.

QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs

TL;DR

QeRL introduces a quantization-enhanced RL framework for LLMs by uniting NVFP4 weight quantization with LoRA and an adaptive quantization noise mechanism. The approach leverages a Marlin-based kernel to accelerate rollout/prefill, and noise-sharing to inject exploration without extra parameters, yielding significant rollout and end-to-end speedups. Empirical results show QeRL matching or surpassing 16-bit LoRA and approaching full fine-tuning on math benchmarks across 3B–32B models, while enabling 32B RL training on a single H100-80GB GPU. This delivers a practical, memory-efficient pathway for RL training in large models with strong reasoning capabilities and improved exploration dynamics.

Abstract

We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for large language models (LLMs). While RL is essential for LLMs' reasoning capabilities, it is resource-intensive, requiring substantial GPU memory and long rollout durations. QeRL addresses these issues by combining NVFP4 quantization with Low-Rank Adaptation (LoRA), accelerating rollout phase of RL while reducing memory overhead. Beyond efficiency, our findings show that quantization noise increases policy entropy, enhancing exploration, and enabling the discovery of better strategies during RL. To further optimize exploration, QeRL introduces an Adaptive Quantization Noise (AQN) mechanism, which dynamically adjusts noise during training. Experiments demonstrate that QeRL delivers over 1.5 times speedup in the rollout phase. Moreover, this is the first framework to enable RL training of a 32B LLM on a single H100 80GB GPU, while delivering overall speedups for RL training. It also achieves faster reward growth and higher final accuracy than 16-bit LoRA and QLoRA, while matching the performance of full-parameter fine-tuning on mathematical benchmarks such as GSM8K (90.8%) and MATH 500 (77.4%) in the 7B model. These results establish QeRL as an efficient and effective framework for RL training in LLMs.

Paper Structure

This paper contains 32 sections, 12 equations, 17 figures, 11 tables, 1 algorithm.

Figures (17)

  • Figure 1: Rollout speedup and accuracy of QeRL $\,$ on Qwen2.5-7B-Instruct. QeRL $\,$ achieves faster RL rollout and end-to-end training speeds (batch=8), while delivering performance superior to vanilla LoRA and QLoRA, also comparable to full-parameter RL on mathematical benchmarks.
  • Figure 2: The illustration of QeRL. (a) RL via LoRA: reducing trainable parameters, but does not alleviate the rollout bottleneck. (b) RL via QLoRA: NF4 quantization with LoRA, but NF4 is slower than LoRA. (c) QeRL: NVFP4 quantization with LoRA, reducing memory and enabling faster RL while matching full-parameter finetuning performance with adaptive quantization noise. AQN dynamically adjusts quantization noise with an exponential scheduler, enhancing exploration.
  • Figure 3: Advancement of Quantization in RL Exploration. Quantization noise brings higher initialized entropy, which encourages exploration in RL training, accelerating the increase of reward.
  • Figure 4: Training reward performance. The upper figures illustrate the training rewards under DAPO, while the lower one is GRPO. Although MXFP4 achieves higher scores in the early stages of training, NVFP4 ultimately converges to better final rewards. LoRA rank is set to 32.
  • Figure 5: Comparison of RL entropy.
  • ...and 12 more figures