Layer Importance for Mathematical Reasoning is Forged in Pre-Training and Invariant after Post-Training
Aadim Nepal, Safal Shrestha, Anubhav Shrestha, Minwu Kim, Jalal Naghiyev, Ravid Shwartz-Ziv, Keith Ross
TL;DR
We address whether post-training edits to mathematical reasoning reflect broad architectural changes or a small set of layer-level adaptations, using layer-wise zero ablation across two model families (Qwen-2.5-7B and Llama-3.1-8B) in base, Instruct, Distill, and RLVR variants. We show that mathematical reasoning depends on a compact set of critical layers whose identities persist after post-training, with removals causing large accuracy drops (up to eighty percent) while post-training does not alter which layers are critical. Representational analysis with $NMI$ reveals elbow-like reductions in cluster similarity around these critical layers, where token representations shift from surface syntactic groupings to semantically structured relations aligned with problem-solving steps. The findings imply that mathematical competence is forged during pre-training and preserved through post-training, suggesting targeted fine-tuning of a small subset of layers could improve efficiency and interpretability in downstream math tasks.
Abstract
Large language models improve at math after instruction tuning, reinforcement learning, or knowledge distillation. We ask whether these gains come from major changes in the transformer layers or from smaller adjustments that keep the original structure. Using layer-wise ablation on base and trained variants, we find that math reasoning depends on a few critical layers, which stay important across all post-training methods. Removing these layers reduces math accuracy by as much as 80%, whereas factual recall tasks only show relatively smaller drops. This suggests that specialized layers for mathematical tasks form during pre-training and remain stable afterward. As measured by Normalized Mutual Information (NMI), we find that near these critical layers, tokens drift from their original syntactic clusters toward representations aligned with tokens less syntactically related but potentially more useful for downstream task.
