Table of Contents
Fetching ...

SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization

Zhi Zheng, Wee Sun Lee

TL;DR

SofT-GRPO tackles the challenge of reinforcing soft-thinking in LLMs with reinforcement learning by injecting controllable stochasticity directly into logits through Gumbel noise and enabling gradient-based updates via Gumbel reparameterization. The method uses a Gumbel-Softmax-based group rollout to explore diverse but valid soft-thinking paths, while a reparameterized loss provides precise attribution of rewards to the output probabilities. Across 1.5B–7B LLMs on five numerical benchmarks and additional out-of-domain tasks, SofT-GRPO achieves a modest boost over discrete-token GRPO on Pass@1 and substantial gains on Pass@32, with further improvements seen when combined with majority voting. These results demonstrate that the soft-thinking paradigm can be effectively strengthened with specialized RLVR techniques, offering practical benefits in accuracy and token efficiency and suggesting broader applicability to other modalities and tasks.

Abstract

The soft-thinking paradigm for Large Language Model (LLM) reasoning can outperform the conventional discrete-token Chain-of-Thought (CoT) reasoning in some scenarios, underscoring its research and application value. However, while the discrete-token CoT reasoning pattern can be reinforced through policy optimization algorithms such as group relative policy optimization (GRPO), extending the soft-thinking pattern with Reinforcement Learning (RL) remains challenging. This difficulty stems from the complexities of injecting stochasticity into soft-thinking tokens and updating soft-thinking policies accordingly. As a result, previous attempts to combine soft-thinking with GRPO typically underperform their discrete-token GRPO counterparts. To fully unlock the potential of soft-thinking, this paper presents a novel policy optimization algorithm, SofT-GRPO, to reinforce LLMs under the soft-thinking reasoning pattern. SofT-GRPO injects the Gumbel noise into logits, employs the Gumbel-Softmax technique to avoid soft-thinking tokens outside the pre-trained embedding space, and leverages the reparameterization trick in policy gradient. We conduct experiments across base LLMs ranging from 1.5B to 7B parameters, and results demonstrate that SofT-GRPO enables soft-thinking LLMs to slightly outperform discrete-token GRPO on Pass@1 (+0.13% on average accuracy), while exhibiting a substantial uplift on Pass@32 (+2.19% on average accuracy). Codes and weights are available on https://github.com/zz1358m/SofT-GRPO-master

SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization

TL;DR

SofT-GRPO tackles the challenge of reinforcing soft-thinking in LLMs with reinforcement learning by injecting controllable stochasticity directly into logits through Gumbel noise and enabling gradient-based updates via Gumbel reparameterization. The method uses a Gumbel-Softmax-based group rollout to explore diverse but valid soft-thinking paths, while a reparameterized loss provides precise attribution of rewards to the output probabilities. Across 1.5B–7B LLMs on five numerical benchmarks and additional out-of-domain tasks, SofT-GRPO achieves a modest boost over discrete-token GRPO on Pass@1 and substantial gains on Pass@32, with further improvements seen when combined with majority voting. These results demonstrate that the soft-thinking paradigm can be effectively strengthened with specialized RLVR techniques, offering practical benefits in accuracy and token efficiency and suggesting broader applicability to other modalities and tasks.

Abstract

The soft-thinking paradigm for Large Language Model (LLM) reasoning can outperform the conventional discrete-token Chain-of-Thought (CoT) reasoning in some scenarios, underscoring its research and application value. However, while the discrete-token CoT reasoning pattern can be reinforced through policy optimization algorithms such as group relative policy optimization (GRPO), extending the soft-thinking pattern with Reinforcement Learning (RL) remains challenging. This difficulty stems from the complexities of injecting stochasticity into soft-thinking tokens and updating soft-thinking policies accordingly. As a result, previous attempts to combine soft-thinking with GRPO typically underperform their discrete-token GRPO counterparts. To fully unlock the potential of soft-thinking, this paper presents a novel policy optimization algorithm, SofT-GRPO, to reinforce LLMs under the soft-thinking reasoning pattern. SofT-GRPO injects the Gumbel noise into logits, employs the Gumbel-Softmax technique to avoid soft-thinking tokens outside the pre-trained embedding space, and leverages the reparameterization trick in policy gradient. We conduct experiments across base LLMs ranging from 1.5B to 7B parameters, and results demonstrate that SofT-GRPO enables soft-thinking LLMs to slightly outperform discrete-token GRPO on Pass@1 (+0.13% on average accuracy), while exhibiting a substantial uplift on Pass@32 (+2.19% on average accuracy). Codes and weights are available on https://github.com/zz1358m/SofT-GRPO-master

Paper Structure

This paper contains 42 sections, 1 theorem, 23 equations, 6 figures, 7 tables.

Key Result

Theorem 3.1

Let $(p_1, \dots, p_n)$ be nonnegative, and $\epsilon_1, \dots, \epsilon_n$ independent samples from $\mathrm{Gumbel}(0,1)$maddison2016concrete,

Figures (6)

  • Figure 1: The soft-thinking pattern (b) passes the expectation of embeddings to the next LLM step zhang2025soft, which can surpass the conventional discrete-token CoT (a) without any fine-tuning. However, employing the GRPO algorithm (c) will boost the performance of discrete-token CoT, but existing attempts (d) of applying RLVR to soft-thinking derive inferior performances. The proposed SofT-GRPO (e) provides the first valid RLVR algorithm, which can outperform the discrete-token CoT with GRPO.
  • Figure 2: The pipeline of the proposed SofT-GRPO algorithm. In training with a Query $\boldsymbol{Q}$, the SofT-GRPO first generates a group of $G$ soft-thinking reasoning paths with Gumbel noises and the Gumbel-Softmax technique jang2016categorical. We transmit the value $g'_i$ and $y'_i$ for the loss calculation afterward. Then, we reconstruct the soft-thinking input. Finally, we update the soft-thinking policy with the off-policy REINFORCE williams1992simple algorithm, optimizing the soft-thinking reasoning tokens with Gumbel reparameterization.
  • Figure 3: Smoothed training or validation curves of ablation studies (the dashed background contains the actual data points). (a) discusses the setting of adding Gumbel noise in SofT-GRPO. (b) discusses the setting of top-p=0.95 and the Gumbel-Softmax temperature $\tau_g=0.1$ in SofT-GRPO.
  • Figure 4: Token consumption curve on LLaMA-3.2-3B-Instruct Base LLM during training.
  • Figure 5: Running discrete-token CoT methods (GRPO and No-finetune) with more temperature options on DeepSeek-R1-Distill-Qwen-1.5B Base LLM. Pass@k represents the pass rate within at most k runs, and Pass@1 is additionally averaged from 32 runs. Experiments are run on the five datasets in Table \ref{['maindata']} for the average.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem 3.1: Gumbel-max Trick
  • proof