Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations

Li Ji-An; Hua-Dong Xiong; Robert C. Wilson; Marcelo G. Mattar; Marcus K. Benna

Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations

Li Ji-An, Hua-Dong Xiong, Robert C. Wilson, Marcelo G. Mattar, Marcus K. Benna

TL;DR

The paper introduces a neuroscience-inspired neurofeedback paradigm to quantify metacognition in large language models, specifically their ability to report and control internal activations. By defining neurofeedback labels from targeted axes in the residual stream and using in-context learning, the authors demonstrate that LLMs can monitor and influence a restricted, low-dimensional metacognitive space. The study reveals that metacognitive reporting is facilitated by semantic interpretability and variance explained by the axis, while control is stronger for interpretable or high-variance directions and improves with more context and model size. These findings have implications for AI safety, including potential evasion of neural-based detectors, and motivate strategies to bolster oversight through subspace-aware monitoring and diversified safety classifiers.

Abstract

Large language models (LLMs) can sometimes report the strategies they actually use to solve tasks, yet at other times seem unable to recognize those strategies that govern their behavior. This suggests a limited degree of metacognition - the capacity to monitor one's own cognitive processes for subsequent reporting and self-control. Metacognition enhances LLMs' capabilities in solving complex tasks but also raises safety concerns, as models may obfuscate their internal processes to evade neural-activation-based oversight (e.g., safety detector). Given society's increased reliance on these models, it is critical that we understand their metacognitive abilities. To address this, we introduce a neuroscience-inspired neurofeedback paradigm that uses in-context learning to quantify metacognitive abilities of LLMs to report and control their activation patterns. We demonstrate that their abilities depend on several factors: the number of in-context examples provided, the semantic interpretability of the neural activation direction (to be reported/controlled), and the variance explained by that direction. These directions span a "metacognitive space" with dimensionality much lower than the model's neural space, suggesting LLMs can monitor only a small subset of their neural activations. Our paradigm provides empirical evidence to quantify metacognition in LLMs, with significant implications for AI safety (e.g., adversarial attack and defense).

Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations

TL;DR

Abstract

Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (27)