SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization
Zhi Zheng, Wee Sun Lee
TL;DR
SofT-GRPO tackles the challenge of reinforcing soft-thinking in LLMs with reinforcement learning by injecting controllable stochasticity directly into logits through Gumbel noise and enabling gradient-based updates via Gumbel reparameterization. The method uses a Gumbel-Softmax-based group rollout to explore diverse but valid soft-thinking paths, while a reparameterized loss provides precise attribution of rewards to the output probabilities. Across 1.5B–7B LLMs on five numerical benchmarks and additional out-of-domain tasks, SofT-GRPO achieves a modest boost over discrete-token GRPO on Pass@1 and substantial gains on Pass@32, with further improvements seen when combined with majority voting. These results demonstrate that the soft-thinking paradigm can be effectively strengthened with specialized RLVR techniques, offering practical benefits in accuracy and token efficiency and suggesting broader applicability to other modalities and tasks.
Abstract
The soft-thinking paradigm for Large Language Model (LLM) reasoning can outperform the conventional discrete-token Chain-of-Thought (CoT) reasoning in some scenarios, underscoring its research and application value. However, while the discrete-token CoT reasoning pattern can be reinforced through policy optimization algorithms such as group relative policy optimization (GRPO), extending the soft-thinking pattern with Reinforcement Learning (RL) remains challenging. This difficulty stems from the complexities of injecting stochasticity into soft-thinking tokens and updating soft-thinking policies accordingly. As a result, previous attempts to combine soft-thinking with GRPO typically underperform their discrete-token GRPO counterparts. To fully unlock the potential of soft-thinking, this paper presents a novel policy optimization algorithm, SofT-GRPO, to reinforce LLMs under the soft-thinking reasoning pattern. SofT-GRPO injects the Gumbel noise into logits, employs the Gumbel-Softmax technique to avoid soft-thinking tokens outside the pre-trained embedding space, and leverages the reparameterization trick in policy gradient. We conduct experiments across base LLMs ranging from 1.5B to 7B parameters, and results demonstrate that SofT-GRPO enables soft-thinking LLMs to slightly outperform discrete-token GRPO on Pass@1 (+0.13% on average accuracy), while exhibiting a substantial uplift on Pass@32 (+2.19% on average accuracy). Codes and weights are available on https://github.com/zz1358m/SofT-GRPO-master
