Training Large Language Models To Reason In Parallel With Global Forking Tokens
Sheng Jia, Xiao Wang, Shiva Prasad Kasiviswanathan
TL;DR
This work addresses the challenge of achieving diverse yet accurate parallel reasoning in large language models without sacrificing correctness. The authors introduce Set Supervised Fine-Tuning (SSFT), which uses a set of global forking tokens to initiate multiple reasoning traces in parallel and a set-based, bipartite-matching loss to align forks with traces, enforcing permutation invariance and preventing collapse of reasoning modes. The approach yields emergent global fork tokens and consistent improvements in Pass@1 and Cons@k across reasoning benchmarks, outperforming standard SFT and naive multi-trace fine-tuning. By enabling coverage-aware training and leveraging distilled traces, SSFT demonstrates a practical pathway to scalable parallel reasoning with improved robustness, while maintaining computational efficiency via a Hungarian-matching-based training loop. The method holds promise for enhancing interpretability and reliability in reasoning-heavy tasks and invites further exploration of scaling fork-token sets and extending to broader evaluation domains.
Abstract
Although LLMs have demonstrated improved performance by scaling parallel test-time compute, doing so relies on generating reasoning paths that are both diverse and accurate. For challenging problems, the forking tokens that trigger diverse yet correct reasoning modes are typically deep in the sampling tree. Consequently, common strategies to encourage diversity, such as temperature scaling, encounter a worsened trade-off between diversity and accuracy. Motivated by this challenge, we treat parallel reasoning as a set-of-next-token-prediction problem, and incorporate a set-based global loss into Supervised Fine-Tuning (SFT) using self-supervised bipartite matching between our global forking tokens and unique reasoning traces. We observe that, while naive fine-tuning with multiple reasoning traces collapses these unique reasoning modes, our proposed method, Set Supervised Fine-Tuning (SSFT), preserves these modes and produces emergent global forking tokens. Experiments on multiple reasoning benchmarks show that our SSFT consistently outperforms SFT under both Pass@1 and Cons@k metrics.
