Table of Contents
Fetching ...

DiSCTT: Consensus-Guided Self-Curriculum for Efficient Test-Time Adaptation in Reasoning

Mohammad Mahdi Moradi, Sudhir Mudur

TL;DR

DiSCTT is proposed, a difficulty-aware, consensus-guided self-curriculum framework that dynamically allocates test-time optimization strategies based on instance-level epistemic uncertainty estimated from agreement among sampled reasoning trajectories that demonstrates more stable, efficient, and effective test-time adaptation for reasoning models.

Abstract

Test-time adaptation offers a promising avenue for improving reasoning performance in large language models without additional supervision, but existing approaches often apply a uniform optimization objective across all inputs, leading to inefficient or unstable adaptation on heterogeneous reasoning problems. We propose DiSCTT, a difficulty-aware, consensus-guided self-curriculum framework that dynamically allocates test-time optimization strategies based on instance-level epistemic uncertainty estimated from agreement among sampled reasoning trajectories. Inputs with high consensus are consolidated via supervised fine-tuning using majority-agreed solutions as pseudo-labels, while low-consensus inputs are optimized via reinforcement learning with a consensus-regularized objective that encourages diversity under relevance constraints. Across a broad suite of mathematical and general reasoning benchmarks, DiSCTT consistently outperforms strong test-time adaptation baselines, achieving higher accuracy with reduced variance and substantially lower computation and wall-clock training times. These results demonstrate that explicitly accounting for instance difficulty and uncertainty enables more stable, efficient, and effective test-time adaptation for reasoning models.

DiSCTT: Consensus-Guided Self-Curriculum for Efficient Test-Time Adaptation in Reasoning

TL;DR

DiSCTT is proposed, a difficulty-aware, consensus-guided self-curriculum framework that dynamically allocates test-time optimization strategies based on instance-level epistemic uncertainty estimated from agreement among sampled reasoning trajectories that demonstrates more stable, efficient, and effective test-time adaptation for reasoning models.

Abstract

Test-time adaptation offers a promising avenue for improving reasoning performance in large language models without additional supervision, but existing approaches often apply a uniform optimization objective across all inputs, leading to inefficient or unstable adaptation on heterogeneous reasoning problems. We propose DiSCTT, a difficulty-aware, consensus-guided self-curriculum framework that dynamically allocates test-time optimization strategies based on instance-level epistemic uncertainty estimated from agreement among sampled reasoning trajectories. Inputs with high consensus are consolidated via supervised fine-tuning using majority-agreed solutions as pseudo-labels, while low-consensus inputs are optimized via reinforcement learning with a consensus-regularized objective that encourages diversity under relevance constraints. Across a broad suite of mathematical and general reasoning benchmarks, DiSCTT consistently outperforms strong test-time adaptation baselines, achieving higher accuracy with reduced variance and substantially lower computation and wall-clock training times. These results demonstrate that explicitly accounting for instance difficulty and uncertainty enables more stable, efficient, and effective test-time adaptation for reasoning models.
Paper Structure (19 sections, 12 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 19 sections, 12 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Accuracy on MATH-500 over test-time training updates for different adaptation strategies. Uniform test-time training baselines show early saturation or unstable gains, whereas DiSCTT achieves more stable and sustained improvements through difficulty-aware routing.
  • Figure 2: AMC accuracy for different adaptation strategies. Difficulty-aware routing and correctness-gated exploration improve convergence stability and prevent performance collapse observed under uniform RL.
  • Figure 3: Overview of DiSCTT. For each input, multiple reasoning completions are sampled and evaluated for consensus. High-consensus inputs are routed to supervised fine-tuning using majority-agreed solutions as pseudo-labels, while low-consensus inputs are optimized via reinforcement learning. This difficulty-aware bifurcation is periodically recomputed, yielding a self-evolving test-time curriculum.
  • Figure 4: Out-of-distribution (OOD) reasoning performance under test-time adaptation. DiSCTT consistently outperforms baselines, indicating improved robustness to distribution shift without sacrificing in-distribution performance.
  • Figure 5: Difficulty-level accuracy dynamics on MATH-500 under different training paradigms. We compare (a) SFT-only, (b) RL-only (GRPO), and (c) DiSCTT across five difficulty levels (L1–L5) over adaptation steps.