LLM Foundation Models: February 2026 Week 7
Feb 12 – Feb 18, 2026 · 164 papers analyzed · 3 breakthroughs
Summary
164 LLM papers analyzed. 3 breakthroughs: (1) 2602.11549 introduces Native Reasoning Training (NRT) achieving verifier-free reasoning via intrinsic rewards, +10% over SFT on 8B model; (2) 2602.12846 identifies RLVR's 'Normalization Squeeze' extinction of rare reasoning paths and proposes ARTS with Flow Matching to recover long-tail solutions (74.6% BoN@16); (3) 2602.11863 formalizes in-context function learning as GP regression with steerable priors via post-training. Trends: verifier-free reasoning training maturing, long-tail reasoning preservation becoming critical, in-context learning getting principled GP-based analysis.
Key Takeaway
Week 7 addresses fundamental training limitations: verifier-free reasoning via intrinsic rewards, long-tail preservation against RLVR's filtering effect, and principled ICL analysis via GP regression with steerable priors.
Breakthroughs (3)
1. Native Reasoning Models: Training Language Models to Reason on Unverifiable Data
Why Novel: First comprehensive framework for training reasoning without external verifiers. Treats reasoning traces as latent variables with intrinsic rewards derived from predictive confidence.
Key Innovations:
- Formalizes reasoning as latent variable optimization: maximize where is reasoning trace
- Introduces aggregation-based reward schemes: Geometric Mean, Weighted Sum with inverse-probability weighting
- NRT-WS achieves 56.2% overall on Llama-3.1-8B vs 46.0% SFT (+10.2%)
- Prevents mode collapse: maintains high entropy, long traces (415 tokens vs 12 for baselines)
- Weighted schemes target model's uncertain tokens: largest gains on high-entropy predictions
Evidence:
- — Intrinsic reward schemes with aggregation functions and token reward signals
- — NRT-WS achieves 56.2% overall on Llama-3.1-8B across 9 benchmarks
- — Training curves show NRT maintains entropy/length/quality while RLPR collapses
- — Weighted schemes provide largest confidence gains on high-entropy tokens
Impact: Enables reasoning training on any SFT data without verifiers or expert traces. Opens path to scalable reasoning in unverifiable domains like creative writing, open-ended QA.
2. Amortized Reasoning Tree Search: Decoupling Proposal and Decision in Large Language Models
Why Novel: First formal analysis of RLVR's 'Normalization Squeeze' that extinguishes rare but valid reasoning paths. Proposes flow-based verifier to preserve long-tail reasoning without modifying base model.
Key Innovations:
- Identifies 'Normalization Squeeze': RLVR acts as high-pass filter, suppressing rare valid traces exponentially
- Tracks relative log-likelihood: rare traces decay systematically despite being correct
- ARTS decouples Proposer (frozen base) from Verifier (flow-based) for inference-time search
- Flow Matching preserves diversity where discriminative verifiers saturate
- Matches GRPO finetuning (74.6% vs 74.7% BoN@16) with <0.3% trainable params
Evidence:
- — Visualization of Normalization Squeeze: rare traces enter extinction zone under RLVR
- — ARTS achieves 74.6% BoN@16 matching GRPO's 74.7% with frozen base model
- — On extinction set (GRPO=0%), ARTS achieves 6.9% vs PRM's 1.7%
- — ARTS 3.2x gain on rare solutions, 8x on Counting & Probability
Impact: Explains why RLVR struggles with hard problems: not capability but filtering. Provides practical alternative via inference-time search that preserves model diversity.
3. In-Context Function Learning in Large Language Models
Why Novel: First principled framework for evaluating ICL on continuous functions via Gaussian Process regression. Shows LLMs achieve GP-like learning curves and priors can be steered via post-training.
Key Innovations:
- Casts ICL as GP regression with known priors: GP as lower bound, 1-NN as upper bound
- LLM learning curves approach GP baseline, well below 1-NN, with log-scaling in model size
- Inductive bias analysis via likelihood under different kernels reveals bias toward rough functions
- Bias shifts toward smoother functions in higher dimensions
- SFT and GRPO steer priors toward training data structure; GRPO offers better generalization
Evidence:
- — Framework overview: ICL evaluation, post-training, and inductive bias analysis via GP likelihoods
- — Learning curves: Qwen-3 approaches GP baseline, 14B/32B show log-scaling improvement
- — Base models biased toward rough kernels (low ); predictions more likely under Matern-1/2
- — Post-training shifts likelihood toward training data kernel
Impact: Provides quantitative framework for understanding and steering ICL priors. Enables principled design of post-training for data-efficient continuous-function tasks.
Trends
Verifier-free reasoning training maturing: NRT achieves competitive performance on unverifiable data
Long-tail reasoning preservation becoming critical: RLVR's normalization squeeze identified and addressed
In-context learning getting principled GP-based analysis with steerable priors
Stateful LLMs emerging: learned memory management replacing hand-crafted RAG pipelines
Cognitive architectures informing LLM agent design: ACT-R-grounded depth adaptation
Notable Papers (5)
1. Prototype Transformer: Towards Language Model Architectures Interpretable by Design
ProtoT replaces attention with prototype-based mixer achieving O(n) complexity. Prototypes learn nameable concepts, enable targeted edits, and support time-scale analysis for transparent reasoning.
2. The Pensieve Paradigm: Stateful Language Models Mastering Their Own Context
StateLM learns to manage internal context via Pensieve-inspired external memory with read/write/delete ops. +52% on BrowseComp-Plus, +10-20% on chat memory via learned state management.
3. Think Fast and Slow: Step-Level Cognitive Depth Adaptation for LLM Agents
CogRouter implements ACT-R-grounded step-level depth adaptation with four cognitive levels. CoSFT + CoPO training achieves SOTA on ALFWorld/ScienceWorld with substantially lower token usage.
4. ThinkRouter: Efficient Reasoning via Routing Thinking between Latent and Discrete Spaces
Dynamic routing based on confidence and threshold . Training-free method yields +19.7pp Pass@1 on average by routing to discrete space when latent confidence is low.
5. Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training
Distribution Discriminant Theory (DDT) quantifies data-model alignment. IDFT reweights SFT loss, Hinted Decoding realigns data, matching DPO/SimPO performance with SFT efficiency.
Honorable Mentions
- To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models ()
- Tiny Recursive Reasoning with Mamba-2 Attention Hybrid ()
- SD-MoE: Spectral Decomposition for Effective Expert Specialization ()
- SpiralFormer: Looped Transformers Can Learn Hierarchical Dependencies via Multi-Resolution Recursion ()
- Stop Unnecessary Reflection: Training LRMs for Efficient Reasoning with Adaptive Reflection and Length Coordinated Penalty ()