Table of Contents
Fetching ...

RL in Name Only? Analyzing the Structural Assumptions in RL post-training for LLMs

Soumya Rani Samineni, Durgesh Kalwar, Karthik Valmeekam, Kaya Stechly, Subbarao Kambhampati

TL;DR

This work critically examines RL-based post-training for LLMs under the prevalent LLM-MDP formulation, showing that treating states as token sequences and distributing terminal rewards uniformly effectively collapses RL to outcome-driven supervised fine-tuning. The authors provide theoretical decompositions and empirical evidence across GSM8K and Countdown demonstrating that GRPO behaves similarly to filtered ISFT when positive and negative samples are used, and that observed longer outputs stem from credit-distribution biases rather than genuine reasoning improvements. They argue that the perceived benefits of RL are artifacts of the degenerate MDP and advocate for richer, more expressive RL formulations (e.g., 2-LLM CoT-style architectures) to unlock the true potential of RL in guiding LLM reasoning. The findings urge researchers to rethink reward structures and state representations rather than focusing on superficial fixes like length penalties.

Abstract

Reinforcement learning-based post-training of large language models (LLMs) has recently gained attention, particularly following the release of DeepSeek R1, which applied GRPO for fine-tuning. Amid the growing hype around improved reasoning abilities attributed to RL post-training, we critically examine the formulation and assumptions underlying these methods. We start by highlighting the popular structural assumptions made in modeling LLM training as a Markov Decision Process (MDP), and show how they lead to a degenerate MDP that doesn't quite need the RL/GRPO apparatus. The two critical structural assumptions include (1) making the MDP states be just a concatenation of the actions-with states becoming the context window and the actions becoming the tokens in LLMs and (2) splitting the reward of a state-action trajectory uniformly across the trajectory. Through a comprehensive analysis, we demonstrate that these simplifying assumptions make the approach effectively equivalent to an outcome-driven supervised learning. Our experiments on benchmarks including GSM8K and Countdown using Qwen-2.5 base models show that iterative supervised fine-tuning, incorporating both positive and negative samples, achieves performance comparable to GRPO-based training. We will also argue that the structural assumptions indirectly incentivize the RL to generate longer sequences of intermediate tokens-which in turn feeds into the narrative of "RL generating longer thinking traces." While RL may well be a very useful technique for improving the reasoning abilities of LLMs, our analysis shows that the simplistic structural assumptions made in modeling the underlying MDP render the popular LLM RL frameworks and their interpretations questionable.

RL in Name Only? Analyzing the Structural Assumptions in RL post-training for LLMs

TL;DR

This work critically examines RL-based post-training for LLMs under the prevalent LLM-MDP formulation, showing that treating states as token sequences and distributing terminal rewards uniformly effectively collapses RL to outcome-driven supervised fine-tuning. The authors provide theoretical decompositions and empirical evidence across GSM8K and Countdown demonstrating that GRPO behaves similarly to filtered ISFT when positive and negative samples are used, and that observed longer outputs stem from credit-distribution biases rather than genuine reasoning improvements. They argue that the perceived benefits of RL are artifacts of the degenerate MDP and advocate for richer, more expressive RL formulations (e.g., 2-LLM CoT-style architectures) to unlock the true potential of RL in guiding LLM reasoning. The findings urge researchers to rethink reward structures and state representations rather than focusing on superficial fixes like length penalties.

Abstract

Reinforcement learning-based post-training of large language models (LLMs) has recently gained attention, particularly following the release of DeepSeek R1, which applied GRPO for fine-tuning. Amid the growing hype around improved reasoning abilities attributed to RL post-training, we critically examine the formulation and assumptions underlying these methods. We start by highlighting the popular structural assumptions made in modeling LLM training as a Markov Decision Process (MDP), and show how they lead to a degenerate MDP that doesn't quite need the RL/GRPO apparatus. The two critical structural assumptions include (1) making the MDP states be just a concatenation of the actions-with states becoming the context window and the actions becoming the tokens in LLMs and (2) splitting the reward of a state-action trajectory uniformly across the trajectory. Through a comprehensive analysis, we demonstrate that these simplifying assumptions make the approach effectively equivalent to an outcome-driven supervised learning. Our experiments on benchmarks including GSM8K and Countdown using Qwen-2.5 base models show that iterative supervised fine-tuning, incorporating both positive and negative samples, achieves performance comparable to GRPO-based training. We will also argue that the structural assumptions indirectly incentivize the RL to generate longer sequences of intermediate tokens-which in turn feeds into the narrative of "RL generating longer thinking traces." While RL may well be a very useful technique for improving the reasoning abilities of LLMs, our analysis shows that the simplistic structural assumptions made in modeling the underlying MDP render the popular LLM RL frameworks and their interpretations questionable.

Paper Structure

This paper contains 18 sections, 13 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Left: Number of correct and incorrect responses. Right: Average response lengths for correct and incorrect responses, on the Countdown test dataset at each evaluation step during post-training of the Qwen-2.5-1.5B base model using GRPO and variant of Filtered-ISFT algorithm.
  • Figure 2: Left: Average response lengths for correct and incorrect responses. Right: Number of correct and incorrect responses, on the Countdown test dataset at each evaluation step during post-training of the Qwen-2.5-1.5B base model using the GRPO algorithm.