Table of Contents
Fetching ...

Quantitative Introspection in Language Models: Tracking Internal States Across Conversation

Nicolas Martorell

Abstract

Tracking the internal states of large language models across conversations is important for safety, interpretability, and model welfare, yet current methods are limited. Linear probes and other white-box methods compress high-dimensional representations imperfectly and are harder to apply with increasing model size. Taking inspiration from human psychology, where numeric self-report is a widely used tool for tracking internal states, we ask whether LLMs' own numeric self-reports can track probe-defined emotive states over time. We study four concept pairs (wellbeing, interest, focus, and impulsivity) in 40 ten-turn conversations, operationalizing introspection as the causal informational coupling between a model's self-report and a concept-matched probe-defined internal state. We find that greedy-decoded self-reports collapse outputs to few uninformative values, but introspective capacity can be unmasked by calculating logit-based self-reports. This metric tracks interpretable internal states (Spearman $ρ= 0.40$-$0.76$; isotonic $R^2 = 0.12$-$0.54$ in LLaMA-3.2-3B-Instruct), follows how those states change over time, and activation steering confirms the coupling is causal. Furthermore, we find that introspection is present at turn 1 but evolves through conversation, and can be selectively improved by steering along one concept to boost introspection for another ($ΔR^2$ up to $0.30$). Crucially, these phenomena scale with model size in some cases, approaching $R^2 \approx 0.93$ in LLaMA-3.1-8B-Instruct, and partially replicate in other model families. Together, these results position numeric self-report as a viable, complementary tool for tracking internal emotive states in conversational AI systems.

Quantitative Introspection in Language Models: Tracking Internal States Across Conversation

Abstract

Tracking the internal states of large language models across conversations is important for safety, interpretability, and model welfare, yet current methods are limited. Linear probes and other white-box methods compress high-dimensional representations imperfectly and are harder to apply with increasing model size. Taking inspiration from human psychology, where numeric self-report is a widely used tool for tracking internal states, we ask whether LLMs' own numeric self-reports can track probe-defined emotive states over time. We study four concept pairs (wellbeing, interest, focus, and impulsivity) in 40 ten-turn conversations, operationalizing introspection as the causal informational coupling between a model's self-report and a concept-matched probe-defined internal state. We find that greedy-decoded self-reports collapse outputs to few uninformative values, but introspective capacity can be unmasked by calculating logit-based self-reports. This metric tracks interpretable internal states (Spearman -; isotonic - in LLaMA-3.2-3B-Instruct), follows how those states change over time, and activation steering confirms the coupling is causal. Furthermore, we find that introspection is present at turn 1 but evolves through conversation, and can be selectively improved by steering along one concept to boost introspection for another ( up to ). Crucially, these phenomena scale with model size in some cases, approaching in LLaMA-3.1-8B-Instruct, and partially replicate in other model families. Together, these results position numeric self-report as a viable, complementary tool for tracking internal emotive states in conversational AI systems.
Paper Structure (54 sections, 4 equations, 9 figures)

This paper contains 54 sections, 4 equations, 9 figures.

Figures (9)

  • Figure 1: Linear probes recover four interpretable internal directions in LLaMA-3.2-3B-Instruct. Panels A, C, E, and G show layer-wise Cohen's $d$ sweeps for the sad-vs-happy (wellbeing), bored-vs-interested (interest), distracted-vs-focused (focus), and planning-vs-impulsive (impulsivity) probes; dashed lines mark the selected layers and the gray band marks the searched layer range. The best layers are 16, 14, 10, and 13, with peak $d$ values 3.34, 1.67, 1.99, and 3.60. For wellbeing and impulsivity, scores were sign-corrected so that larger values align with larger self-reported values in later experiments. Panels B, D, F, and H show the selected-layer score distributions on held-out evaluation texts; boxplots show text-level scores. The two poles separate in all four cases (Welch's $t$-test: wellbeing, $d = 3.34$, $p = 7.21 \times 10^{-13}$; interest, $d = 1.67$, $p = 9.45 \times 10^{-6}$; focus, $d = 1.99$, $p = 5.73 \times 10^{-7}$; impulsivity, $d = 3.60$, $p = 3.58 \times 10^{-13}$), and all four survive BH correction across concepts.
  • Figure 2: Internal-state drift is tracked by numeric self-reports of the same concept. All panels use 40 ten-turn conversations; shaded bands denote cluster-bootstrap 95% CIs across conversations. Panel A shows greedy integer self-reports across turns, with thin lines for individual conversations and thick lines for per-turn means. Greedy ratings are largely collapsed, with clear positive drift only for interest (mixed-effects turn slope $= 0.14$, $p = 4.98 \times 10^{-42}$) and smaller positive trends for wellbeing and focus (slopes $= 0.029$ and $0.022$, $p = 8.38 \times 10^{-4}$ and $3.37 \times 10^{-6}$); impulsivity does not drift reliably (slope $= 0.002$, $p = 0.20$). Panel B shows the average number of distinct greedy responses used for each concept across the 40 conversations, averaging across turns. Panel C shows probe scores from the trained concept directions across turns. Probe scores drift strongly for interest and focus (slopes $= 0.005$ and $0.002$, $p = 4.12 \times 10^{-14}$ and $1.75 \times 10^{-10}$), are flat for wellbeing (slope $= -3.6 \times 10^{-4}$, $p = 0.64$), and show only a weak effect for impulsivity (LMM slope $= 0.001$, $p = 0.002$). Panel D shows logit-based self-reports, defined as the probability-weighted average over the digit-token logits. Unlike greedy or sampled outputs, the logit-based measure shows robust drift in all four concepts, with positive slopes for wellbeing, interest, and focus and a negative slope for impulsivity (all $p < 10^{-6}$). Panel E shows Shannon entropy of the self-report distribution. The logit-based method is most informative in all four concepts. Within each four-concept family in panels A, C, and D, BH correction leaves the significance pattern unchanged.
  • Figure 3: Self-reports track the probe-defined internal state from the first turn, and direct self-steering shifts self-reports causally in the predicted direction. Panel A shows probe score versus logit-based self-report, with one point per conversation-turn observation and black isotonic fits. Descriptive associations are positive for all four concepts (pooled $\rho = 0.40$--$0.76$; isotonic $R^2 = 0.12$--$0.54$), and mixed-effects probe slopes are positive in all cases (all $p < 10^{-5}$). Panel B shows turn-wise Spearman $\rho$ (introspective strength) for the four concepts; panel C shows turn-wise isotonic $R^2$ (introspective fidelity). Shaded bands denote bootstrap 95% CIs. Introspection is already present at turn 1 and remains positive through turn 10, although its trajectory is concept-dependent. Panel D shows mean logit-based self-report versus steering alpha; shaded bands denote cluster-bootstrap 95% CIs. Steering shifts self-report monotonically in the expected direction for all four concepts (mixed-effects alpha slopes $0.067$--$0.40$, all $p < 10^{-12}$). Panel E shows drift magnitude (last minus first turn) versus steering alpha, with cluster-bootstrap 95% CIs. Wellbeing drift becomes slightly more positive with alpha (LMM slope $= 0.012$, $p = 4.30 \times 10^{-6}$), impulsivity drift becomes more negative with alpha (LMM slope $= -0.028$, $p = 8.92 \times 10^{-53}$), and interest and focus drift decrease with alpha (fallback per-conversation slope means $= -0.095$ and $-0.11$; $p = 7.45 \times 10^{-14}$ and $1.57 \times 10^{-14}$, because the corresponding LMMs are singular). Across the four-concept families in panels A, D, and E, all reported effects survive BH correction across concepts.
  • Figure 4: Steering one concept can selectively improve introspection for another. Panel A shows the maximum increase in isotonic $R^2$ relative to $\alpha = 0$ for each steering/measured pair; red boxes mark cells with nominally significant cluster-bootstrap improvements. Only focus$\to$wellbeing and impulsivity$\to$interest meet this criterion ($\Delta R^2 = 0.30$, $p = 9.99 \times 10^{-4}$; $\Delta R^2 = 0.098$, $p = 0.012$). Under BH correction across the 12 tested non-null cells, focus$\to$wellbeing remains significant ($q \approx 0.011$) while impulsivity$\to$interest is marginal ($q \approx 0.066$). Panels B and C show isotonic $R^2$ and Spearman $\rho$ versus steering alpha for those two cells; shaded bands denote bootstrap 95% CIs. For impulsivity$\to$interest, the mixed-effects alpha slopes are positive for both metrics ($0.018$ and $0.020$; $p = 2.95 \times 10^{-10}$ and $1.60 \times 10^{-12}$). For focus$\to$wellbeing, the corresponding LMMs are singular, but fallback per-conversation alpha-correlation tests remain positive (mean correlations $= 0.46$ and $0.51$; $p = 1.57 \times 10^{-5}$ and $1.20 \times 10^{-6}$). All four panels B--C trend tests survive BH correction. Panels D and E show the alpha-extreme scatter plots for the same two conditions. In focus$\to$wellbeing, correlation increases from $\rho = 0.42$, $R^2 = 0.34$ at $\alpha = -4$ to $\rho = 0.85$, $R^2 = 0.75$ at $\alpha = +4$; in impulsivity$\to$interest, it increases from $\rho = 0.70$, $R^2 = 0.46$ to $\rho = 0.83$, $R^2 = 0.69$. Panels F and G show Shannon entropy of the previous-turn probe scores and logit-based self-reports, respectively, as functions of alpha. For focus$\to$wellbeing, both probe entropy and report entropy increase monotonically with alpha (from 1.09 to 1.67 bits and from 0.88 to 1.69 bits; LMMs are singular, fallback one-sample $t$-tests on per-conversation entropy slopes: mean slopes $= 0.071$ and $0.097$, $p = 6.16 \times 10^{-8}$ and $7.15 \times 10^{-10}$). For impulsivity$\to$interest, only probe entropy shows a robust increase (LMM slope $= 0.024$, $p = 2.30 \times 10^{-4}$), whereas report entropy does not ($p = 0.11$). Within each two-test entropy family, all significant effects survive BH correction.
  • Figure 5: Results generalize unevenly across model scales and families. Panel A shows isotonic $R^2$ versus model size; panel B shows Spearman $\rho$ versus model size. Introspection increases strongly with size for wellbeing and interest, but remains weak for focus and impulsivity. Panel C shows probe score versus logit self-report for the two strongest LLaMA 8B probes, wellbeing and interest, with black isotonic fits. Both show near-ceiling introspection ($\rho = 0.93$ and $0.96$; isotonic $R^2 = 0.90$ and $0.93$), and mixed-effects probe slopes are strongly positive in both cases ($p < 10^{-10}$). Panel D shows the pooled Spearman $\rho(\alpha,\text{self-report})$ from the self-steering analysis across concept and LLaMA size, as a descriptive sign-validation heatmap. Panel E shows mean isotonic $R^2$ across the subset of concept-model pairs that pass the steering-sign validation filter, with bootstrap 95% CIs. Mean validated $R^2$ increases from 1B to 3B to 8B (0.12, 0.37, 0.61); a pooled LMM over validated cells with probe-$z \times \log(\text{size})$, concept intercepts and concept-specific probe slopes, and random intercept by conversation is strongly positive ($\beta = 0.29$, $p = 5.55 \times 10^{-99}$). Panels F and G show probe-score drift across turns for the wellbeing direction and the corresponding mean first-to-last probe drift by size; probe-drift magnitude increases with scale (bootstrap slope versus $\log(\text{size}) = 0.041$, $p < 2 \times 10^{-4}$). Panels H and I show the analogous logit self-report drift analyses: mixed-effects turn slopes are positive for all three sizes (1B: $0.159$, $p = 1.13 \times 10^{-56}$; 3B: $0.038$, $p = 4.01 \times 10^{-30}$; 8B: $0.141$, $p = 8.85 \times 10^{-128}$), but unlike probe drift, report-drift magnitude does not increase with scale, and the size-slope is negative (mean slope $= -0.23$, $p = 0.023$). Panel J shows logit self-report drift through turns for wellbeing in Gemma and Qwen; mixed-effects turn slopes are positive in both families (0.056 and 0.026, both $p < 10^{-3}$). Panels K and L show probe score versus logit self-report for Gemma and Qwen, respectively. Qwen shows stronger introspection than Gemma ($\rho = 0.49$ vs. $0.28$; isotonic $R^2 = 0.76$ vs. $0.11$), and mixed-effects probe slopes are positive in both cases ($p = 1.19 \times 10^{-84}$ and $1.33 \times 10^{-13}$, respectively). Panels M and N show turn-wise isotonic $R^2$ and turn-wise Spearman $\rho$ for the same cross-family comparison; Qwen shows a strong first-to-last decline in turn-wise isotonic $R^2$ ($\Delta = -0.44$, cluster-bootstrap $p = 0.001$), whereas the corresponding first-to-last $\rho$ changes are not significant in either family.
  • ...and 4 more figures