Table of Contents
Fetching ...

Test-time RL alignment exposes task familiarity artifacts in LLM benchmarks

Kun Wang, Reinhard Heckel

Abstract

Direct evaluation of LLMs on benchmarks can be misleading because comparatively strong performance may reflect task familiarity rather than capability. The train-before-test approach controls for task familiarity by giving each model task-relevant training before evaluation, originally through supervised finetuning. However, suitable training data is often hard to come by, and evaluation results vary with the data chosen. In this paper, we propose a two-stage test-time reinforcement learning (RL) alignment method for train-before-test. First, RL with a single sample provides a first alignment of the model to the task format, and second, test-time RL with majority-voting reward aligns the model to the benchmark distribution. Our test-time RL alignment method aligns similarly well as SFT-based train-before test, but without requiring a task-specific training set. On a domain-specific benchmark without training data, we show that direct evaluation underestimates base models which perform substantially better once aligned, yielding a more faithful evaluation of their capabilities. Moreover, for reasoning tasks, the performance gap between fine-tuned models and their base models largely disappears after alignment, suggesting that many gains from RLVR/SFT reported in the literature are not a difference in reasoning capability, but rather artifacts of task familiarity.

Test-time RL alignment exposes task familiarity artifacts in LLM benchmarks

Abstract

Direct evaluation of LLMs on benchmarks can be misleading because comparatively strong performance may reflect task familiarity rather than capability. The train-before-test approach controls for task familiarity by giving each model task-relevant training before evaluation, originally through supervised finetuning. However, suitable training data is often hard to come by, and evaluation results vary with the data chosen. In this paper, we propose a two-stage test-time reinforcement learning (RL) alignment method for train-before-test. First, RL with a single sample provides a first alignment of the model to the task format, and second, test-time RL with majority-voting reward aligns the model to the benchmark distribution. Our test-time RL alignment method aligns similarly well as SFT-based train-before test, but without requiring a task-specific training set. On a domain-specific benchmark without training data, we show that direct evaluation underestimates base models which perform substantially better once aligned, yielding a more faithful evaluation of their capabilities. Moreover, for reasoning tasks, the performance gap between fine-tuned models and their base models largely disappears after alignment, suggesting that many gains from RLVR/SFT reported in the literature are not a difference in reasoning capability, but rather artifacts of task familiarity.
Paper Structure (31 sections, 12 figures, 2 tables)

This paper contains 31 sections, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Fine-tuning gains are largely diminished after task alignment. Left: The reported performance reveals substantial accuracy gaps between base models and their fine-tuned variants. Right: After applying TTRA to all models, the base model's performance increases significantly, nearly matching that of the fine-tuned models, suggesting that many gains from RLVR/SFT reported in the literature are not a difference in reasoning capability, but rather artifacts of task familiarity.
  • Figure 2: Overview of the TTRA base train-before-test evaluation pipeline.
  • Figure 3: TTRA harmonizes model rankings between GSM8K and MathQA. Left: Direct evaluation shows highly discordant model rankings across the two benchmarks. Right: After TTRA alignment, the rankings become substantially more consistent and harmonized.
  • Figure 4: TTRA preserves existing ranking harmony between GSM8K and GSM-plus-mini. Left: Direct evaluation shows already consistent rankings for these similar-format tasks. Right: TTRA evaluation maintains this high consistency.
  • Figure 5: Stability analysis on MATH500. Our dataset-free TTRA method is robust to random choices of one-shot sample selection ($\Delta<1\%$). In contrast, SFT-based train-before-test is highly sensitive to the training data used.
  • ...and 7 more figures