Table of Contents
Fetching ...

PRISM: Demystifying Retention and Interaction in Mid-Training

Bharat Runwal, Ashish Agrawal, Anurag Roy, Rameswar Panda

Abstract

We present PRISM, a comprehensive empirical study of mid-training design choices for large language models. Through controlled experiments across seven base models spanning four families (Granite, LLaMA, Mistral, Nemotron-H), two architecture types (dense Transformer and attention-Mamba hybrid), and scales from 3B to 24B parameters, we show that mid-training on approximately 27B high-quality tokens yields consistent gains of +15 to +40 points on math, +5 to +12 points on code, and +6 to +13 points on science benchmarks while preserving general performance. The full PRISM to RL pipeline improves macro-average across six reasoning benchmarks from under 12 to 29-42 (a 3-4x improvement), whereas RL applied directly to most of the base models remains substantially less effective, with AIME scores near zero. Data composition matters most at mid-training, not RL: including science data during mid-training unlocks +17 to +28 point GPQA-Diamond gains during RL, while changing the RL mix produces less than 2 point differences. Mechanistically, mid-training densely restructures over 90% of model weights, while RL makes sparse, front-loaded refinements to approximately 5% of parameters. Representation analysis (CKA) confirms that RL consistently preserves mid-training's representational geometry (over 0.998 CKA) across architectures. Crucially, RL applies identical weight changes regardless of starting point, yet only succeeds on mid-trained models, consistent with mid-training placing the model in a configuration from which RL can effectively improve performance. Our results demonstrate that retention-aware mid-training is highly effective for reliable reasoning enhancement and provide practical guidance for designing robust mid-training pipelines.

PRISM: Demystifying Retention and Interaction in Mid-Training

Abstract

We present PRISM, a comprehensive empirical study of mid-training design choices for large language models. Through controlled experiments across seven base models spanning four families (Granite, LLaMA, Mistral, Nemotron-H), two architecture types (dense Transformer and attention-Mamba hybrid), and scales from 3B to 24B parameters, we show that mid-training on approximately 27B high-quality tokens yields consistent gains of +15 to +40 points on math, +5 to +12 points on code, and +6 to +13 points on science benchmarks while preserving general performance. The full PRISM to RL pipeline improves macro-average across six reasoning benchmarks from under 12 to 29-42 (a 3-4x improvement), whereas RL applied directly to most of the base models remains substantially less effective, with AIME scores near zero. Data composition matters most at mid-training, not RL: including science data during mid-training unlocks +17 to +28 point GPQA-Diamond gains during RL, while changing the RL mix produces less than 2 point differences. Mechanistically, mid-training densely restructures over 90% of model weights, while RL makes sparse, front-loaded refinements to approximately 5% of parameters. Representation analysis (CKA) confirms that RL consistently preserves mid-training's representational geometry (over 0.998 CKA) across architectures. Crucially, RL applies identical weight changes regardless of starting point, yet only succeeds on mid-trained models, consistent with mid-training placing the model in a configuration from which RL can effectively improve performance. Our results demonstrate that retention-aware mid-training is highly effective for reliable reasoning enhancement and provide practical guidance for designing robust mid-training pipelines.
Paper Structure (102 sections, 11 equations, 30 figures, 23 tables)

This paper contains 102 sections, 11 equations, 30 figures, 23 tables.

Figures (30)

  • Figure 1: PRISM overview. Mid-training decisions are decomposed into their principal design axes, including retention of general and long-context abilities, domain interaction (math, code, science), benchmark selection, reinforcement learning compatibility, and scaling behavior. PRISM enables holistic evaluation of mid-training choices across model families at scale.
  • Figure 2: Mid-training data mixture configurations and per-source sampling percentages. The outer ring shows individual data sources; the inner ring groups them by domain category.
  • Figure 3: Long-context restoration pipeline. After PRISM mid-training degrades RULER@128k from 59.09 to 6.46, a linear merge (15% base + 85% mid-trained) followed by long-context extension recovers performance to 42.16 (full params) or 37.75 (attention-only).
  • Figure 4: $\textsc{PRISM} \to \text{RL}$: Granite-3.3-8B. RL training curves on the PRISM-mid-trained checkpoint using the unbalanced MCS mix. All benchmarks show consistent, monotonic improvements.
  • Figure 5: $\textsc{PRISM} \to \text{RL}$: Mistral-Small 24B. The largest model tested shows the strongest GPQA-Diamond gains (+27.95) and non-saturating code improvements.
  • ...and 25 more figures