Table of Contents
Fetching ...

Is my model "mind blurting"? Interpreting the dynamics of reasoning tokens with Recurrence Quantification Analysis (RQA)

Quoc Tuan Pham, Mehdi Jafari, Flora Salim

TL;DR

This work addresses the challenge of diagnosing extensive reasoning traces in large language models by moving beyond text-based proxies like response length. It introduces Recurrence Quantification Analysis (RQA) to treat token-generation as a dynamical system, extracting latent trajectories from final-layer embeddings and computing metrics such as DET, LAM, and ENTR to characterize predictability, stalling, and complexity. Temporal RQA, with sliding-window analysis, outperforms length-based baselines in predicting task complexity on a ZebraLogic benchmark, achieving an absolute improvement of $8\%$ on a set of $3{,}600$ traces, while length remains informative for binary failure. The paper demonstrates the potential of non-textual, structure-sensitive diagnostics to monitor and possibly control test-time reasoning dynamics, and outlines limitations and future work on parameter sensitivity and generalization across architectures and domains.

Abstract

Test-time compute is central to large reasoning models, yet analysing their reasoning behaviour through generated text is increasingly impractical and unreliable. Response length is often used as a brute proxy for reasoning effort, but this metric fails to capture the dynamics and effectiveness of the Chain of Thoughts (CoT) or the generated tokens. We propose Recurrence Quantification Analysis (RQA) as a non-textual alternative for analysing model's reasoning chains at test time. By treating token generation as a dynamical system, we extract hidden embeddings at each generation step and apply RQA to the resulting trajectories. RQA metrics, including Determinism and Laminarity, quantify patterns of repetition and stalling in the model's latent representations. Analysing 3,600 generation traces from DeepSeek-R1-Distill, we show that RQA captures signals not reflected by response length, but also substantially improves prediction of task complexity by 8\%. These results help establish RQA as a principled tool for studying the latent token generation dynamics of test-time scaling in reasoning models.

Is my model "mind blurting"? Interpreting the dynamics of reasoning tokens with Recurrence Quantification Analysis (RQA)

TL;DR

This work addresses the challenge of diagnosing extensive reasoning traces in large language models by moving beyond text-based proxies like response length. It introduces Recurrence Quantification Analysis (RQA) to treat token-generation as a dynamical system, extracting latent trajectories from final-layer embeddings and computing metrics such as DET, LAM, and ENTR to characterize predictability, stalling, and complexity. Temporal RQA, with sliding-window analysis, outperforms length-based baselines in predicting task complexity on a ZebraLogic benchmark, achieving an absolute improvement of on a set of traces, while length remains informative for binary failure. The paper demonstrates the potential of non-textual, structure-sensitive diagnostics to monitor and possibly control test-time reasoning dynamics, and outlines limitations and future work on parameter sensitivity and generalization across architectures and domains.

Abstract

Test-time compute is central to large reasoning models, yet analysing their reasoning behaviour through generated text is increasingly impractical and unreliable. Response length is often used as a brute proxy for reasoning effort, but this metric fails to capture the dynamics and effectiveness of the Chain of Thoughts (CoT) or the generated tokens. We propose Recurrence Quantification Analysis (RQA) as a non-textual alternative for analysing model's reasoning chains at test time. By treating token generation as a dynamical system, we extract hidden embeddings at each generation step and apply RQA to the resulting trajectories. RQA metrics, including Determinism and Laminarity, quantify patterns of repetition and stalling in the model's latent representations. Analysing 3,600 generation traces from DeepSeek-R1-Distill, we show that RQA captures signals not reflected by response length, but also substantially improves prediction of task complexity by 8\%. These results help establish RQA as a principled tool for studying the latent token generation dynamics of test-time scaling in reasoning models.
Paper Structure (32 sections, 7 equations, 6 figures, 2 tables)

This paper contains 32 sections, 7 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Problems with CoT Analysis - Top (The Challenges): (1) Massive token counts prevent human audit; (2) Total length is a poor proxy for reasoning quality; (3) Textual output often masks internal logic failures; (4) Static interpretability fails to see time-dependent patterns. Bottom (RQA: the proposed solution): We propose RQA to transform discrete tokens into a measurable latent trajectory, decoding the dynamics of reasoning. We demonstrate that these temporal signals significantly outperform response length in resolving task complexity and identifying structural signals of the dynamics.
  • Figure 2: The proposed RQA interpretability pipeline. (A) Tokens are generated autoregressively. (B) Latent states form a high-dimensional trajectory. (C) Self-similarity is mapped to a recurrence matrix. (D) Non-stationary dynamics are quantified via sliding windows. (E) Temporal features (slopes, DFA) serve as inputs for downstream classification.
  • Figure 3: Hidden-state trajectory visualisation. Cosine similarity and recurrence plot ($\epsilon=0.1$) illustrating DET and LAM structures.
  • Figure 4: Preliminary evaluation showing the inverse correlation between combinatorial complexity and puzzle-level accuracy across DeepSeek distilled models.
  • Figure 5: Confusion matrices for complexity classification. (a) Response-length baseline, which performs well on extreme complexity levels but struggles in intermediate regimes. (b) Temporal RQA, which exhibits more uniform performance across complexity levels.
  • ...and 1 more figures