LLM Foundation Models: February 2026 Week 7

Feb 12 – Feb 18, 2026 · 164 papers analyzed · 3 breakthroughs

Summary

164 LLM papers analyzed. 3 breakthroughs: (1) 2602.11549 introduces Native Reasoning Training (NRT) achieving verifier-free reasoning via intrinsic rewards, +10% over SFT on 8B model; (2) 2602.12846 identifies RLVR's 'Normalization Squeeze' extinction of rare reasoning paths and proposes ARTS with Flow Matching to recover long-tail solutions (74.6% BoN@16); (3) 2602.11863 formalizes in-context function learning as GP regression with steerable priors via post-training. Trends: verifier-free reasoning training maturing, long-tail reasoning preservation becoming critical, in-context learning getting principled GP-based analysis.

Key Takeaway

Week 7 addresses fundamental training limitations: verifier-free reasoning via intrinsic rewards, long-tail preservation against RLVR's filtering effect, and principled ICL analysis via GP regression with steerable priors.

Breakthroughs (3)

1. Native Reasoning Models: Training Language Models to Reason on Unverifiable Data

Why Novel: First comprehensive framework for training reasoning without external verifiers. Treats reasoning traces as latent variables with intrinsic rewards derived from predictive confidence.

Key Innovations:

Formalizes reasoning as latent variable optimization: maximize $P(y^*|x,z)$ where $z$ is reasoning trace
Introduces aggregation-based reward schemes: Geometric Mean, Weighted Sum with inverse-probability weighting
NRT-WS $(-\log p)$ achieves 56.2% overall on Llama-3.1-8B vs 46.0% SFT (+10.2%)
Prevents mode collapse: maintains high entropy, long traces (415 tokens vs 12 for baselines)
Weighted schemes target model's uncertain tokens: largest gains on high-entropy predictions

Evidence:

— Intrinsic reward schemes with aggregation functions and token reward signals
— NRT-WS achieves 56.2% overall on Llama-3.1-8B across 9 benchmarks
— Training curves show NRT maintains entropy/length/quality while RLPR collapses
— Weighted schemes provide largest confidence gains on high-entropy tokens

Impact: Enables reasoning training on any SFT data without verifiers or expert traces. Opens path to scalable reasoning in unverifiable domains like creative writing, open-ended QA.

2. Amortized Reasoning Tree Search: Decoupling Proposal and Decision in Large Language Models

Why Novel: First formal analysis of RLVR's 'Normalization Squeeze' that extinguishes rare but valid reasoning paths. Proposes flow-based verifier to preserve long-tail reasoning without modifying base model.

Key Innovations:

Identifies 'Normalization Squeeze': RLVR acts as high-pass filter, suppressing rare valid traces exponentially
Tracks relative log-likelihood: rare traces decay systematically despite being correct
ARTS decouples Proposer (frozen base) from Verifier (flow-based) for inference-time search
Flow Matching preserves diversity where discriminative verifiers saturate
Matches GRPO finetuning (74.6% vs 74.7% BoN@16) with <0.3% trainable params

Evidence:

— Visualization of Normalization Squeeze: rare traces enter extinction zone under RLVR
— ARTS achieves 74.6% BoN@16 matching GRPO's 74.7% with frozen base model
— On extinction set (GRPO=0%), ARTS achieves 6.9% vs PRM's 1.7%
— ARTS 3.2x gain on rare solutions, 8x on Counting & Probability

Impact: Explains why RLVR struggles with hard problems: not capability but filtering. Provides practical alternative via inference-time search that preserves model diversity.

3. In-Context Function Learning in Large Language Models

Why Novel: First principled framework for evaluating ICL on continuous functions via Gaussian Process regression. Shows LLMs achieve GP-like learning curves and priors can be steered via post-training.

Key Innovations:

Casts ICL as GP regression with known priors: GP as lower bound, 1-NN as upper bound
LLM learning curves approach GP baseline, well below 1-NN, with log-scaling in model size
Inductive bias analysis via likelihood under different kernels reveals bias toward rough functions
Bias shifts toward smoother functions in higher dimensions
SFT and GRPO steer priors toward training data structure; GRPO offers better generalization

Evidence:

— Framework overview: ICL evaluation, post-training, and inductive bias analysis via GP likelihoods
— Learning curves: Qwen-3 approaches GP baseline, 14B/32B show log-scaling improvement
— Base models biased toward rough kernels (low $\nu$ ); predictions more likely under Matern-1/2
— Post-training shifts likelihood toward training data kernel

Impact: Provides quantitative framework for understanding and steering ICL priors. Enables principled design of post-training for data-efficient continuous-function tasks.

Trends

Verifier-free reasoning training maturing: NRT achieves competitive performance on unverifiable data
Long-tail reasoning preservation becoming critical: RLVR's normalization squeeze identified and addressed
In-context learning getting principled GP-based analysis with steerable priors
Stateful LLMs emerging: learned memory management replacing hand-crafted RAG pipelines
Cognitive architectures informing LLM agent design: ACT-R-grounded depth adaptation

Notable Papers (5)

1. Prototype Transformer: Towards Language Model Architectures Interpretable by Design

ProtoT replaces attention with prototype-based mixer achieving O(n) complexity. Prototypes learn nameable concepts, enable targeted edits, and support time-scale analysis for transparent reasoning.

2. The Pensieve Paradigm: Stateful Language Models Mastering Their Own Context

StateLM learns to manage internal context via Pensieve-inspired external memory with read/write/delete ops. +52% on BrowseComp-Plus, +10-20% on chat memory via learned state management.

3. Think Fast and Slow: Step-Level Cognitive Depth Adaptation for LLM Agents

CogRouter implements ACT-R-grounded step-level depth adaptation with four cognitive levels. CoSFT + CoPO training achieves SOTA on ALFWorld/ScienceWorld with substantially lower token usage.

4. ThinkRouter: Efficient Reasoning via Routing Thinking between Latent and Discrete Spaces

Dynamic routing based on confidence $p_t^{\max}$ and threshold $\tau$ . Training-free method yields +19.7pp Pass@1 on average by routing to discrete space when latent confidence is low.

5. Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training

Distribution Discriminant Theory (DDT) quantifies data-model alignment. IDFT reweights SFT loss, Hinted Decoding realigns data, matching DPO/SimPO performance with SFT efficiency.

Honorable Mentions

To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models ()
Tiny Recursive Reasoning with Mamba-2 Attention Hybrid ()
SD-MoE: Spectral Decomposition for Effective Expert Specialization ()
SpiralFormer: Looped Transformers Can Learn Hierarchical Dependencies via Multi-Resolution Recursion ()
Stop Unnecessary Reflection: Training LRMs for Efficient Reasoning with Adaptive Reflection and Length Coordinated Penalty ()