Is my model "mind blurting"? Interpreting the dynamics of reasoning tokens with Recurrence Quantification Analysis (RQA)
Quoc Tuan Pham, Mehdi Jafari, Flora Salim
TL;DR
This work addresses the challenge of diagnosing extensive reasoning traces in large language models by moving beyond text-based proxies like response length. It introduces Recurrence Quantification Analysis (RQA) to treat token-generation as a dynamical system, extracting latent trajectories from final-layer embeddings and computing metrics such as DET, LAM, and ENTR to characterize predictability, stalling, and complexity. Temporal RQA, with sliding-window analysis, outperforms length-based baselines in predicting task complexity on a ZebraLogic benchmark, achieving an absolute improvement of $8\%$ on a set of $3{,}600$ traces, while length remains informative for binary failure. The paper demonstrates the potential of non-textual, structure-sensitive diagnostics to monitor and possibly control test-time reasoning dynamics, and outlines limitations and future work on parameter sensitivity and generalization across architectures and domains.
Abstract
Test-time compute is central to large reasoning models, yet analysing their reasoning behaviour through generated text is increasingly impractical and unreliable. Response length is often used as a brute proxy for reasoning effort, but this metric fails to capture the dynamics and effectiveness of the Chain of Thoughts (CoT) or the generated tokens. We propose Recurrence Quantification Analysis (RQA) as a non-textual alternative for analysing model's reasoning chains at test time. By treating token generation as a dynamical system, we extract hidden embeddings at each generation step and apply RQA to the resulting trajectories. RQA metrics, including Determinism and Laminarity, quantify patterns of repetition and stalling in the model's latent representations. Analysing 3,600 generation traces from DeepSeek-R1-Distill, we show that RQA captures signals not reflected by response length, but also substantially improves prediction of task complexity by 8\%. These results help establish RQA as a principled tool for studying the latent token generation dynamics of test-time scaling in reasoning models.
