Table of Contents
Fetching ...

How LLMs Learn to Reason: A Complex Network Perspective

Sihan Hu, Xiansheng Cai, Yuan Huang, Zhiyuan Yao, Linfeng Zhang, Pan Zhang, Youjin Deng, Kun Chen

TL;DR

This work tackles why RL training with verifiable rewards exhibits rapid early gains, a two-stage learning trajectory, a V-shaped solution-length curve, and forgetting. It posits a unifying mechanism: a sparse concept web with average degree $\ ight<k\right>\approx 2$ emerging during RLVR, captured by the minimal Concept Network Model (CoNet). From this topology, it derives microscopic mechanisms—frustration-induced forgetting and phase-transition-like learning—and introduces Annealed-RLVR, a theory-guided intervention that improves performance on both in-distribution and out-of-distribution benchmarks. The framework offers a principled lens for understanding and engineering emergent reasoning in future AI systems, with implications for explainability and safety, and sets a path toward empirical mapping of the internal reasoning graph.

Abstract

Training large language models with Reinforcement Learning with Verifiable Rewards (RLVR) exhibits a set of distinctive and puzzling behaviors that remain poorly understood, including a two-stage learning curve, a V-shaped response-length trajectory, and a pronounced vulnerability to catastrophic forgetting. In this work, we propose that these behaviors are emergent collective phenomena governed not by neural implementation details, but by the topological evolution of the latent reasoning graph in semantic space. By demonstrating a dynamical isomorphism between a 1.5B-parameter LLM and a minimal Concept Network Model (CoNet), we trace the causal source to the self-organization of a sparse concept web pinned to an average degree of two. This geometric perspective provides a unified physical explanation for the observed anomalies: the V-shaped trajectory tracks the evolution from parallel local skill optimization to global network integration; catastrophic forgetting stems from the topological disconnection of critical ``trunk'' edges; and policy collapse arises from the accumulation of sequential transitions at the web's leaf nodes, where broad exploration abruptly freezes into rigid, high-reward trajectories. Identifying a ``maximally frustrated state'' at the transition between learning stages, we propose Annealed-RLVR, a principled algorithm that injects a targeted SFT ``heating'' step to resolve this topological bottleneck. Experiments confirm that this theory-driven intervention outperforms standard RLVR on both in-distribution and out-of-distribution benchmarks (including Minerva and AIME). By recasting RLVR from black-box optimization into a predictable process of structural self-organization, our work provides a new physical intuition for engineering the emergent reasoning capabilities of future AI systems.

How LLMs Learn to Reason: A Complex Network Perspective

TL;DR

This work tackles why RL training with verifiable rewards exhibits rapid early gains, a two-stage learning trajectory, a V-shaped solution-length curve, and forgetting. It posits a unifying mechanism: a sparse concept web with average degree emerging during RLVR, captured by the minimal Concept Network Model (CoNet). From this topology, it derives microscopic mechanisms—frustration-induced forgetting and phase-transition-like learning—and introduces Annealed-RLVR, a theory-guided intervention that improves performance on both in-distribution and out-of-distribution benchmarks. The framework offers a principled lens for understanding and engineering emergent reasoning in future AI systems, with implications for explainability and safety, and sets a path toward empirical mapping of the internal reasoning graph.

Abstract

Training large language models with Reinforcement Learning with Verifiable Rewards (RLVR) exhibits a set of distinctive and puzzling behaviors that remain poorly understood, including a two-stage learning curve, a V-shaped response-length trajectory, and a pronounced vulnerability to catastrophic forgetting. In this work, we propose that these behaviors are emergent collective phenomena governed not by neural implementation details, but by the topological evolution of the latent reasoning graph in semantic space. By demonstrating a dynamical isomorphism between a 1.5B-parameter LLM and a minimal Concept Network Model (CoNet), we trace the causal source to the self-organization of a sparse concept web pinned to an average degree of two. This geometric perspective provides a unified physical explanation for the observed anomalies: the V-shaped trajectory tracks the evolution from parallel local skill optimization to global network integration; catastrophic forgetting stems from the topological disconnection of critical ``trunk'' edges; and policy collapse arises from the accumulation of sequential transitions at the web's leaf nodes, where broad exploration abruptly freezes into rigid, high-reward trajectories. Identifying a ``maximally frustrated state'' at the transition between learning stages, we propose Annealed-RLVR, a principled algorithm that injects a targeted SFT ``heating'' step to resolve this topological bottleneck. Experiments confirm that this theory-driven intervention outperforms standard RLVR on both in-distribution and out-of-distribution benchmarks (including Minerva and AIME). By recasting RLVR from black-box optimization into a predictable process of structural self-organization, our work provides a new physical intuition for engineering the emergent reasoning capabilities of future AI systems.

Paper Structure

This paper contains 26 sections, 1 equation, 13 figures, 1 table.

Figures (13)

  • Figure 1: CoNet Reproduces Core LLM Training Dynamics. The minimal CoNet model (a) reproduces the two core empirical signatures of RLVR training observed in the DeepScaleR-1.5B LLM (b). These signatures are: (i) a two-stage reward dynamic consisting of a fast-learning stage followed by a slow-learning phase (top panels), and (ii) a non-monotonic, V-shaped evolution of the correct response length (bottom panels). This striking correspondence validates CoNet as a valuable minimal model for our theoretical analysis. The inset in (a) schematically depicts the CoNet abstraction. Lines in (b) are smoothed with a 60-step moving average.
  • Figure 2: Visualizing the Structural Evolution from "Skill Islands" to the Concept Web. These network snapshots provide a direct, qualitative narrative of the concept web's formation, corresponding to the different phases of training. In each subfigure, green, red, and blue dots indicate the question, answer, and intermediate nodes of the CoNet, respectively. The color of each directed edge represents the transition probability from the head to the tail and is drawn according to the colorbar in the rightmost column. Here, only edges with transition probabilities greater than $0.95$ are shown. (a) In the early fast-learning phase (Step 20), the model discovers a few short, disjointed reasoning paths, representing the first nascent "skill islands". (b) At the onset of slow learning (Step 50), the system has proliferated into a maximal collection of disconnected islands. (c) Deep in the slow-learning phase (Step 800), these previously separate islands have coalesced into a single, giant connected component, forming the unified and expansive concept web.
  • Figure 3: A Sparse-Web Structure Necessitates Longer Reasoning Chains. This figure links the emergent sparse topology of the concept web to the observed increase in response length. (b, c) The color and marker scheme is identical to that used in Fig. \ref{['fig:conet_micro']}, and only edges with transition probability greater than $0.95$ are retained. The largest connected component (a local region of which is shown here) remains sparse with average degree around $2$ even as it grows from step $50$ to $800$. (a) Consequently, the distribution of solution path lengths, shown in both raw counts (histogram) and probability density (inset), shifts decisively to the right during this period, confirming that navigating the sparse backbone requires longer reasoning chains.
  • Figure 4: The Fragility of a Sparse Web: Catastrophic Forgetting and Fast Recovery. This figure demonstrates a key prediction of our sparse-web hypothesis: that the concept web is fragile, relying on critical bridge-like connections. (c-e) The color and shape conventions follow those of Fig. \ref{['fig:conet_micro']}, and only a local view of the concept web is shown. Orange crosses on the edges indicate decreases in transition weights, with thicker crosses denoting more substantial decreases. The microscopic view in CoNet shows how a lightweight supervised fine-tuning (SFT) on a converged web (c) severs these critical bridges, causing the structure to fragment (d). Subsequent RLVR rapidly repairs these links (e). (a, b) This microscopic severing manifests as macroscopic catastrophic forgetting in both CoNet and the 1.5B LLM, where performance plummets upon initiating SFT. The subsequent fast recovery once RLVR resumes highlights the localized nature of the damage, underscoring the web's structural fragility.
  • Figure 5: Microscopic Mechanisms of RLVR: Forgetting by Frustration, Learning by Phase Transition The learning trajectories for individual problems in CoNet (a) and the 1.5B LLM (b) reveal a fundamental duality. (i) Frustration-Induced Forgetting: At the onset of slow learning [see inset in (a)], the intense competition for connections on a sparse web manifests as volatile, non-monotonic accuracy curves, where some skills are competitively suppressed. (ii) Phase-Transition-Like Learning: Subsequently, the web's sparse frontier enables new skills to be acquired in punctuated, accelerated jumps (orange and green). The smoother gradients in the LLM (vs. the clean steps in CoNet) can be attributed to finite-size effects, as explained in the main text.
  • ...and 8 more figures