A transformer architecture alteration to incentivise externalised reasoning

Elizabeth Pavlova; Mariia Koroliuk; Karthik Viswanathan; Cameron Tice; Edward James Young; Puria Radmard

A transformer architecture alteration to incentivise externalised reasoning

Elizabeth Pavlova, Mariia Koroliuk, Karthik Viswanathan, Cameron Tice, Edward James Young, Puria Radmard

Abstract

We propose a new architectural change, and post-training pipeline, for making LLMs more verbose reasoners by teaching a model to truncate forward passes early. We augment an existing transformer architecture with an early-exit mechanism at intermediate layers and train the model to exit at shallower layers when the next token can be predicted without deep computation. After a calibration stage, we incentivise the model to exit as early as possible while maintaining task performance using reinforcement learning. We provide preliminary results to this effect for small reasoning models, showing that they learn to adaptively reduce computations across tokens. We predict that, applied at the right scale, our approach can minimise the amount of excess computation that reasoning models have at their disposal to perform non-myopic planning using their internal activations, reserving this only for difficult-to-predict tokens.

A transformer architecture alteration to incentivise externalised reasoning

Abstract

Paper Structure (10 sections, 6 equations, 4 figures, 1 table)

This paper contains 10 sections, 6 equations, 4 figures, 1 table.

Introduction
Methods
Results
Self-distilled early exits match target distribution
Discussion
Conclusion
Weight calibration with SFT - more details
Target distribution verification
Distillation training verification
Scorer Prompt

Figures (4)

Figure 1: Overview of the early exit architecture.(A) Calibration via self-distillation: early-exit heads at intermediate layers sample exit decisions from learned probability distributions. Exit probabilities are calibrated against a teacher (the full-depth model) where KL divergence between intermediate and final layer logits determines targets, trained alongside token predictions (Appendix \ref{['app:sft']}) (B) Early exit incentivisation via RL: the model generates sequences using the learned exit mechanism, with task reward from output tokens and early exit reward from skipped layers. Layers above the exit point are skipped, and the frozen residual stream representation is passed directly to the unembedding layer.
Figure 2: Left: RL training dynamics on Qwen3-4B for the Theory of Mind Sclar2024 task. Each panel shows individual runs (light lines), mean across runs (bold line with markers), and $\pm$1 SE error bars. The dashed line indicates the pre-SFT base model performance. Step 0 represents the SFT-calibrated model before RL. Accuracy improves while average compute decreases, demonstrating that the model learns to exit earlier without sacrificing performance. Coherence remains stable throughout. Evaluated on 60 theory of mind prompts per step. Right: Token-level visualisation of early exit behaviour. Each token is coloured by the layer at which computation was terminated. The model adaptively varies computational depth, using fewer layers for predictable tokens and the full depth for more complex ones. We used $\lambda=1.5$ and $\beta=0.25$.
Figure 3: Distribution of coherence scores across models with different KL factors. As the KL factor increases, coherence distributions shift toward higher scores, approaching base model performance.
Figure 4: Distribution of exit layers. The model's learned exit distribution (orange bars) closely follows the target exit probability distribution from training data (grey bars), demonstrating effective learning of exit behaviour during Stage 1 calibration.

A transformer architecture alteration to incentivise externalised reasoning

Abstract

A transformer architecture alteration to incentivise externalised reasoning

Authors

Abstract

Table of Contents

Figures (4)