Back to artifacts

LLM Foundation Models: February 2026 Week 7

Feb 12 – Feb 18, 2026 · 164 papers analyzed · 3 breakthroughs

Summary

164 LLM papers analyzed. 3 breakthroughs: (1) 2602.11549 introduces Native Reasoning Training (NRT) achieving verifier-free reasoning via intrinsic rewards, +10% over SFT on 8B model; (2) 2602.12846 identifies RLVR's 'Normalization Squeeze' extinction of rare reasoning paths and proposes ARTS with Flow Matching to recover long-tail solutions (74.6% BoN@16); (3) 2602.11863 formalizes in-context function learning as GP regression with steerable priors via post-training. Trends: verifier-free reasoning training maturing, long-tail reasoning preservation becoming critical, in-context learning getting principled GP-based analysis.

Key Takeaway

Week 7 addresses fundamental training limitations: verifier-free reasoning via intrinsic rewards, long-tail preservation against RLVR's filtering effect, and principled ICL analysis via GP regression with steerable priors.

Breakthroughs (3)

1. Native Reasoning Models: Training Language Models to Reason on Unverifiable Data

Why Novel: First comprehensive framework for training reasoning without external verifiers. Treats reasoning traces as latent variables with intrinsic rewards derived from predictive confidence.

Key Innovations:

  • Formalizes reasoning as latent variable optimization: maximize P(yx,z)P(y^*|x,z) where zz is reasoning trace
  • Introduces aggregation-based reward schemes: Geometric Mean, Weighted Sum with inverse-probability weighting
  • NRT-WS(logp)(-\log p) achieves 56.2% overall on Llama-3.1-8B vs 46.0% SFT (+10.2%)
  • Prevents mode collapse: maintains high entropy, long traces (415 tokens vs 12 for baselines)
  • Weighted schemes target model's uncertain tokens: largest gains on high-entropy predictions

Evidence:

  • — Intrinsic reward schemes with aggregation functions and token reward signals
  • — NRT-WS achieves 56.2% overall on Llama-3.1-8B across 9 benchmarks
  • — Training curves show NRT maintains entropy/length/quality while RLPR collapses
  • — Weighted schemes provide largest confidence gains on high-entropy tokens

Impact: Enables reasoning training on any SFT data without verifiers or expert traces. Opens path to scalable reasoning in unverifiable domains like creative writing, open-ended QA.

2. Amortized Reasoning Tree Search: Decoupling Proposal and Decision in Large Language Models

Why Novel: First formal analysis of RLVR's 'Normalization Squeeze' that extinguishes rare but valid reasoning paths. Proposes flow-based verifier to preserve long-tail reasoning without modifying base model.

Key Innovations:

  • Identifies 'Normalization Squeeze': RLVR acts as high-pass filter, suppressing rare valid traces exponentially
  • Tracks relative log-likelihood: rare traces decay systematically despite being correct
  • ARTS decouples Proposer (frozen base) from Verifier (flow-based) for inference-time search
  • Flow Matching preserves diversity where discriminative verifiers saturate
  • Matches GRPO finetuning (74.6% vs 74.7% BoN@16) with <0.3% trainable params

Evidence:

  • — Visualization of Normalization Squeeze: rare traces enter extinction zone under RLVR
  • — ARTS achieves 74.6% BoN@16 matching GRPO's 74.7% with frozen base model
  • — On extinction set (GRPO=0%), ARTS achieves 6.9% vs PRM's 1.7%
  • — ARTS 3.2x gain on rare solutions, 8x on Counting & Probability

Impact: Explains why RLVR struggles with hard problems: not capability but filtering. Provides practical alternative via inference-time search that preserves model diversity.

3. In-Context Function Learning in Large Language Models

Why Novel: First principled framework for evaluating ICL on continuous functions via Gaussian Process regression. Shows LLMs achieve GP-like learning curves and priors can be steered via post-training.

Key Innovations:

  • Casts ICL as GP regression with known priors: GP as lower bound, 1-NN as upper bound
  • LLM learning curves approach GP baseline, well below 1-NN, with log-scaling in model size
  • Inductive bias analysis via likelihood under different kernels reveals bias toward rough functions
  • Bias shifts toward smoother functions in higher dimensions
  • SFT and GRPO steer priors toward training data structure; GRPO offers better generalization

Evidence:

  • — Framework overview: ICL evaluation, post-training, and inductive bias analysis via GP likelihoods
  • — Learning curves: Qwen-3 approaches GP baseline, 14B/32B show log-scaling improvement
  • — Base models biased toward rough kernels (low ν\nu); predictions more likely under Matern-1/2
  • — Post-training shifts likelihood toward training data kernel

Impact: Provides quantitative framework for understanding and steering ICL priors. Enables principled design of post-training for data-efficient continuous-function tasks.

Trends

  • Verifier-free reasoning training maturing: NRT achieves competitive performance on unverifiable data

  • Long-tail reasoning preservation becoming critical: RLVR's normalization squeeze identified and addressed

  • In-context learning getting principled GP-based analysis with steerable priors

  • Stateful LLMs emerging: learned memory management replacing hand-crafted RAG pipelines

  • Cognitive architectures informing LLM agent design: ACT-R-grounded depth adaptation

Notable Papers (5)

1. Prototype Transformer: Towards Language Model Architectures Interpretable by Design

ProtoT replaces attention with prototype-based mixer achieving O(n) complexity. Prototypes learn nameable concepts, enable targeted edits, and support time-scale analysis for transparent reasoning.

2. The Pensieve Paradigm: Stateful Language Models Mastering Their Own Context

StateLM learns to manage internal context via Pensieve-inspired external memory with read/write/delete ops. +52% on BrowseComp-Plus, +10-20% on chat memory via learned state management.

3. Think Fast and Slow: Step-Level Cognitive Depth Adaptation for LLM Agents

CogRouter implements ACT-R-grounded step-level depth adaptation with four cognitive levels. CoSFT + CoPO training achieves SOTA on ALFWorld/ScienceWorld with substantially lower token usage.

4. ThinkRouter: Efficient Reasoning via Routing Thinking between Latent and Discrete Spaces

Dynamic routing based on confidence ptmaxp_t^{\max} and threshold τ\tau. Training-free method yields +19.7pp Pass@1 on average by routing to discrete space when latent confidence is low.

5. Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training

Distribution Discriminant Theory (DDT) quantifies data-model alignment. IDFT reweights SFT loss, Hinted Decoding realigns data, matching DPO/SimPO performance with SFT efficiency.

Honorable Mentions

  • To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models ()
  • Tiny Recursive Reasoning with Mamba-2 Attention Hybrid ()
  • SD-MoE: Spectral Decomposition for Effective Expert Specialization ()
  • SpiralFormer: Looped Transformers Can Learn Hierarchical Dependencies via Multi-Resolution Recursion ()
  • Stop Unnecessary Reflection: Training LRMs for Efficient Reasoning with Adaptive Reflection and Length Coordinated Penalty ()