Table of Contents
Fetching ...

Mode-Conditioning Unlocks Superior Test-Time Scaling

Chen Henry Wu, Sachin Goyal, Aditi Raghunathan

TL;DR

The paper addresses diversity collapse in parallel test-time sampling for large language models and introduces Mode-conditioning (ModC), which allocates inference compute across distinct reasoning modes. It presents two practical training instantiations—specialist models and mode-specific prefixes—and an automated mode-discovery approach via gradient clustering. Across controlled Countdown tasks, large-scale math reasoning (OpenThoughts, NuminaMath), and reinforcement learning, ModC yields up to 4× inference efficiency and meaningful Pass@$k$ gains, while also improving the frontier of achievable solutions. The work demonstrates that standard training underutilizes data diversity and provides a scalable, versatile remedy for unlocking robust test-time scaling.

Abstract

Parallel sampling promises substantial gains in test-time scaling, but its effectiveness is sharply limited by diversity collapse, where models concentrate on a few modes and repeated samples produce the same mistakes. We propose the mode-conditioning (ModC) framework, which explicitly allocates test-time compute across reasoning modes using either specialist models or mode-specific prefixes. ModC consistently improves scaling across controlled graph-search tasks and large-scale reasoning benchmarks, spanning model families and sizes from 0.5B to 7B. On OpenThoughts, fine-tuning Qwen2.5-7B with ModC achieves a 4x efficiency gain over standard training while also improving the maximum attainable Pass@k. We further show that gradient clustering enables ModC without explicit mode labels, yielding up to 10% gains on datasets such as NuminaMath. Finally, we show that ModC improves reinforcement learning (RL) and can further boost diversity-inducing RL methods. These results demonstrate that standard training underutilizes the diversity in data, and that ModC provides a simple, effective remedy for unlocking the full benefits of diversity in test-time scaling.

Mode-Conditioning Unlocks Superior Test-Time Scaling

TL;DR

The paper addresses diversity collapse in parallel test-time sampling for large language models and introduces Mode-conditioning (ModC), which allocates inference compute across distinct reasoning modes. It presents two practical training instantiations—specialist models and mode-specific prefixes—and an automated mode-discovery approach via gradient clustering. Across controlled Countdown tasks, large-scale math reasoning (OpenThoughts, NuminaMath), and reinforcement learning, ModC yields up to 4× inference efficiency and meaningful Pass@ gains, while also improving the frontier of achievable solutions. The work demonstrates that standard training underutilizes data diversity and provides a scalable, versatile remedy for unlocking robust test-time scaling.

Abstract

Parallel sampling promises substantial gains in test-time scaling, but its effectiveness is sharply limited by diversity collapse, where models concentrate on a few modes and repeated samples produce the same mistakes. We propose the mode-conditioning (ModC) framework, which explicitly allocates test-time compute across reasoning modes using either specialist models or mode-specific prefixes. ModC consistently improves scaling across controlled graph-search tasks and large-scale reasoning benchmarks, spanning model families and sizes from 0.5B to 7B. On OpenThoughts, fine-tuning Qwen2.5-7B with ModC achieves a 4x efficiency gain over standard training while also improving the maximum attainable Pass@k. We further show that gradient clustering enables ModC without explicit mode labels, yielding up to 10% gains on datasets such as NuminaMath. Finally, we show that ModC improves reinforcement learning (RL) and can further boost diversity-inducing RL methods. These results demonstrate that standard training underutilizes the diversity in data, and that ModC provides a simple, effective remedy for unlocking the full benefits of diversity in test-time scaling.

Paper Structure

This paper contains 23 sections, 7 equations, 10 figures.

Figures (10)

  • Figure 1: Mode-conditioning for test-time scaling. Modern LLMs often collapse to a single strategy, making Pass@$k$ scaling suboptimal: if the chosen strategy is wrong, every attempt fails. (Left) In a controlled graph task solvable by DFS or BFS, models trained on both still often commit to just one. To address this, we introduce mode-conditioning (ModC) that explicitly allocates test-time compute across modes. We study two training methods that enable this: separate models or a single model with mode-specific prefixes. (Right) 4$\times$ efficiency gains with ModC training. We apply ModC to long chain-of-thought reasoning distillation on the OpenThoughts dataset. With ModC, the model achieves the same Pass@1024 as standard training using only $k=256$ samples, yielding an $\sim$4$\times$ improvement in inference efficiency. Moreover, ModC also improves the maximum attainable Pass@$k$, pushing the frontier of test-time scaling.
  • Figure 2: Standard training fails to balance diverse modes per problem under repeated sampling. This issue does not go away with balanced training data. Instead, ModC explicitly targets and successfully achieves balanced test-time compute allocation.
  • Figure 3: Balanced test-time allocation improves scaling. (a) On the natural test set of Countdown, balanced test-time allocation with ModC shows consistent improvements as $k$ increases. (b) On the adversarial test set where each problem is challenging for one one algorithm (oracle DFS or BFS), the gains from enforced mode diversity are even more pronounced.
  • Figure 4: Ablation studies on Countdown. ModC with random paritioning sometimes shows gains but does not outperform ModC with DFS/BFS partition. Balanced training data DFS/BFS distribution does not show gains compared to standard training.
  • Figure 5: ModC improves short CoT reasoning. Pass@$k$ on MATH500. Naively mixing teacher data underperforms the single-teacher baseline, while ModC shows consistent gains. ModC with prefixes generally works better than ModC with separate models underscoring the importance of sharing knowledge across modes (teacher strategies) in math reasoning.
  • ...and 5 more figures