Table of Contents
Fetching ...

The Impact of Quantization on Large Reasoning Model Reinforcement Learning

Medha Kumar, Zifei Xu, Xin Wang, Tristan Webb

TL;DR

This work investigates how different quantization strategies affect reinforcement-learning-based reasoning in large reasoning models. By comparing quantization-aware RL training (QAFT) against post-training quantization (PTQ) and QLoRA on math-focused benchmarks using the Qwen3 family, the authors reveal a notable performance gap: quantization-aware RL training can hinder learning, while PTQ and QLoRA maintain or improve reasoning performance at inference. Across model scales, PTQ and 4-bit QLoRA generally achieve favorable memory-performance trade-offs, with PTQ methods performing well even at 4-bit precision. The findings suggest avoiding abrupt quantization during RL training, and point to downstream quantization (PTQ, QLoRA) as more robust for preserving mathematical reasoning in LRMs, with implications for deployment efficiency and future quantization techniques.

Abstract

Strong reasoning capabilities can now be achieved by large-scale reinforcement learning (RL) without any supervised fine-tuning. Although post-training quantization (PTQ) and quantization-aware training (QAT) are well studied in the context of fine-tuning, how quantization impacts RL in large reasoning models (LRMs) remains an open question. To answer this question, we conducted systematic experiments and discovered a significant gap in reasoning performance on mathematical benchmarks between post-RL quantized models and their quantization-aware RL optimized counterparts. Our findings suggest that quantization-aware RL training negatively impacted the learning process, whereas PTQ and QLoRA led to greater performance.

The Impact of Quantization on Large Reasoning Model Reinforcement Learning

TL;DR

This work investigates how different quantization strategies affect reinforcement-learning-based reasoning in large reasoning models. By comparing quantization-aware RL training (QAFT) against post-training quantization (PTQ) and QLoRA on math-focused benchmarks using the Qwen3 family, the authors reveal a notable performance gap: quantization-aware RL training can hinder learning, while PTQ and QLoRA maintain or improve reasoning performance at inference. Across model scales, PTQ and 4-bit QLoRA generally achieve favorable memory-performance trade-offs, with PTQ methods performing well even at 4-bit precision. The findings suggest avoiding abrupt quantization during RL training, and point to downstream quantization (PTQ, QLoRA) as more robust for preserving mathematical reasoning in LRMs, with implications for deployment efficiency and future quantization techniques.

Abstract

Strong reasoning capabilities can now be achieved by large-scale reinforcement learning (RL) without any supervised fine-tuning. Although post-training quantization (PTQ) and quantization-aware training (QAT) are well studied in the context of fine-tuning, how quantization impacts RL in large reasoning models (LRMs) remains an open question. To answer this question, we conducted systematic experiments and discovered a significant gap in reasoning performance on mathematical benchmarks between post-RL quantized models and their quantization-aware RL optimized counterparts. Our findings suggest that quantization-aware RL training negatively impacted the learning process, whereas PTQ and QLoRA led to greater performance.

Paper Structure

This paper contains 10 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Mean training reward observed during RL training of Qwen3-8B, windowed moving average (window size $=25$) shown.
  • Figure 2: Evaluation reward vs. model size across all the models that we evaluated. We show the optimum pareto frontier as a dashed line.