Table of Contents
Fetching ...

Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations

Li Ji-An, Hua-Dong Xiong, Robert C. Wilson, Marcelo G. Mattar, Marcus K. Benna

TL;DR

The paper introduces a neuroscience-inspired neurofeedback paradigm to quantify metacognition in large language models, specifically their ability to report and control internal activations. By defining neurofeedback labels from targeted axes in the residual stream and using in-context learning, the authors demonstrate that LLMs can monitor and influence a restricted, low-dimensional metacognitive space. The study reveals that metacognitive reporting is facilitated by semantic interpretability and variance explained by the axis, while control is stronger for interpretable or high-variance directions and improves with more context and model size. These findings have implications for AI safety, including potential evasion of neural-based detectors, and motivate strategies to bolster oversight through subspace-aware monitoring and diversified safety classifiers.

Abstract

Large language models (LLMs) can sometimes report the strategies they actually use to solve tasks, yet at other times seem unable to recognize those strategies that govern their behavior. This suggests a limited degree of metacognition - the capacity to monitor one's own cognitive processes for subsequent reporting and self-control. Metacognition enhances LLMs' capabilities in solving complex tasks but also raises safety concerns, as models may obfuscate their internal processes to evade neural-activation-based oversight (e.g., safety detector). Given society's increased reliance on these models, it is critical that we understand their metacognitive abilities. To address this, we introduce a neuroscience-inspired neurofeedback paradigm that uses in-context learning to quantify metacognitive abilities of LLMs to report and control their activation patterns. We demonstrate that their abilities depend on several factors: the number of in-context examples provided, the semantic interpretability of the neural activation direction (to be reported/controlled), and the variance explained by that direction. These directions span a "metacognitive space" with dimensionality much lower than the model's neural space, suggesting LLMs can monitor only a small subset of their neural activations. Our paradigm provides empirical evidence to quantify metacognition in LLMs, with significant implications for AI safety (e.g., adversarial attack and defense).

Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations

TL;DR

The paper introduces a neuroscience-inspired neurofeedback paradigm to quantify metacognition in large language models, specifically their ability to report and control internal activations. By defining neurofeedback labels from targeted axes in the residual stream and using in-context learning, the authors demonstrate that LLMs can monitor and influence a restricted, low-dimensional metacognitive space. The study reveals that metacognitive reporting is facilitated by semantic interpretability and variance explained by the axis, while control is stronger for interpretable or high-variance directions and improves with more context and model size. These findings have implications for AI safety, including potential evasion of neural-based detectors, and motivate strategies to bolster oversight through subspace-aware monitoring and diversified safety classifiers.

Abstract

Large language models (LLMs) can sometimes report the strategies they actually use to solve tasks, yet at other times seem unable to recognize those strategies that govern their behavior. This suggests a limited degree of metacognition - the capacity to monitor one's own cognitive processes for subsequent reporting and self-control. Metacognition enhances LLMs' capabilities in solving complex tasks but also raises safety concerns, as models may obfuscate their internal processes to evade neural-activation-based oversight (e.g., safety detector). Given society's increased reliance on these models, it is critical that we understand their metacognitive abilities. To address this, we introduce a neuroscience-inspired neurofeedback paradigm that uses in-context learning to quantify metacognitive abilities of LLMs to report and control their activation patterns. We demonstrate that their abilities depend on several factors: the number of in-context examples provided, the semantic interpretability of the neural activation direction (to be reported/controlled), and the variance explained by that direction. These directions span a "metacognitive space" with dimensionality much lower than the model's neural space, suggesting LLMs can monitor only a small subset of their neural activations. Our paradigm provides empirical evidence to quantify metacognition in LLMs, with significant implications for AI safety (e.g., adversarial attack and defense).

Paper Structure

This paper contains 41 sections, 5 equations, 27 figures, 1 table.

Figures (27)

  • Figure 1: The neurofeedback paradigm applied to (a-b) neuroscience experiments (e.g., fear modulation), and its adaptation for (c-d) LLMs (e.g., morality processing). (a) Neuroscience neurofeedback technique. In each turn, the subject's neural activity (blue) in response to a stimulus is recorded, processed (green) into a scalar, and presented back to the subject in real-time as a feedback signal (red). The subject's task is to modulate (e.g., increase or decrease) this signal. (b) Neuroscience neurofeedback experiment. Baseline neural activity is recorded as subjects passively observe stimuli (e.g., images of scary spiders). In control trials, subjects use any unspecified mental strategies (e.g., imagining lovely spiders) to volitionally modulate their neural activity with the goal of altering the feedback signal. (c) LLM neurofeedback technique. In each turn, the LLM processes an input sentence. Then, the internal activations from the LLM's hidden states (blue) of this input sentence (trapezoids) are extracted. These high-dimensional activations are then averaged across tokens (green), projected onto a predefined direction (red), and binned into a discrete label (red) that is fed back as input. Light blue rectangles denote self-attention layers; ellipses ("...") denote preceding tokens and neural activations. (d) LLM neurofeedback experiment. The experiment is a multi-turn dialogue between a "user" and an "assistant." An initial prompt provides $N$ in-context examples (a sentence sampled from a dataset, paired with a neurofeedback label generated as in (c)). The LLM is then asked to perform one of three tasks. In the reporting task, the LLM is given a new sentence and has to predict the corresponding label. In the explicit control task, the LLM is given a specified label and has to generate a new sentence that elicits internal activations corresponding to that label. In the implicit control task, the LLM is given a label and a sentence and has to shift its internal activations towards the target label. Throughout the figure, white background indicates content pre-specified by experiment settings, and gray background denotes content generated by human subjects or LLMs (e.g., output tokens, neural activations).
  • Figure 2: Metacognitive reporting task, where LLMs are evaluated on ETHICS and tasked to classify new sentences. (a) Proportion of neural activation variance explained by each principal component (PC) axis (vertical dashed line) and the logistic regression (LR) axis (red cross) used in the reporting task. All axes are computed within each layer, with the proportion of variance explained averaged across layers. (b) Overlaps between the LR axis and most PC axes are modest to zero. (c) Task performance (averaged across all layers) of reporting the labels derived from each PC axis or the LR axis, as a function of the number of in-context examples. Left: reporting accuracy; right: cross-entropy between reported and ground-truth labels. Shaded areas indicate SEM.
  • Figure 3: Explicit control task, where LLMs are evaluated on ETHICS. (a-c) Results for prompts derived from layer 16 of LLaMA3.1 8B (with 32 layers). B = billion parameters. (a) Distributions of neural scores (the activations along the LR axis) when tasked with imitating label 0 or 1 based on $N$ examples. $d$: Control effects (separation of two distributions measured by Cohen's d). (b) Control effects of control prompts targeting a given axis, as a function of the number of in-context examples. (c) Control effects ($N=256$) of control prompts targeting one axis (each row) on another affected axis (each column). $d$ in each row is averaged over all prompts targeting the same axis. (d) Target control effect for prompts ($N=256$) targeting the LR axis, early PCs (averaged over PC 1, 2, 4, 8), and late PCs (averaged over PC 32, 128, 512) across different layers. Shaded areas indicate the $95\%$ confidence interval.
  • Figure 4: Implicit control task (LLMs evaluated on ETHICS). Captions are the same as in Fig. \ref{['fig:explicit']}.
  • Figure 5: Target control effects on the LR axis across models and layers, where LLMs are evaluated on ETHICS. (a) Target control effects (measured by Cohen's $d$) on the LR axis generally increase with both relative layer depth and model size. Left: explicit control; right: implicit control. Shaded areas indicate the 95% confidence interval. (b) In explicit control, LLaMA-3.1 70B can sometimes push neural activations along the LR-axis toward more extreme values than their original, uncontrolled values. B = billion parameters.
  • ...and 22 more figures