Table of Contents
Fetching ...

Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct

Christopher Ackerman, Nina Panickssery

TL;DR

This study demonstrates that the Llama3-8b-Instruct chat model can distinguish its own text from human-written text in several contexts, while the base model cannot, pointing to post-training exposure as a key factor. By deriving a residual-stream self-recognition vector through contrastive analysis, the authors show that this vector is causally linked to self-authorship judgments and can be steered to make the model claim or deny authorship. They further show that applying the vector to input tokens can color the model’s perception and bias its outputs, effectively controlling both attribution and interpretation of texts. The findings have important AI-safety implications, including potential uses for defense against jailbreaks and for warning systems, and raise questions about the generality of self-recognition across larger models.

Abstract

It has been reported that LLMs can recognize their own writing. As this has potential implications for AI safety, yet is relatively understudied, we investigate the phenomenon, seeking to establish whether it robustly occurs at the behavioral level, how the observed behavior is achieved, and whether it can be controlled. First, we find that the Llama3-8b-Instruct chat model - but not the base Llama3-8b model - can reliably distinguish its own outputs from those of humans, and present evidence that the chat model is likely using its experience with its own outputs, acquired during post-training, to succeed at the writing recognition task. Second, we identify a vector in the residual stream of the model that is differentially activated when the model makes a correct self-written-text recognition judgment, show that the vector activates in response to information relevant to self-authorship, present evidence that the vector is related to the concept of "self" in the model, and demonstrate that the vector is causally related to the model's ability to perceive and assert self-authorship. Finally, we show that the vector can be used to control both the model's behavior and its perception, steering the model to claim or disclaim authorship by applying the vector to the model's output as it generates it, and steering the model to believe or disbelieve it wrote arbitrary texts by applying the vector to them as the model reads them.

Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct

TL;DR

This study demonstrates that the Llama3-8b-Instruct chat model can distinguish its own text from human-written text in several contexts, while the base model cannot, pointing to post-training exposure as a key factor. By deriving a residual-stream self-recognition vector through contrastive analysis, the authors show that this vector is causally linked to self-authorship judgments and can be steered to make the model claim or deny authorship. They further show that applying the vector to input tokens can color the model’s perception and bias its outputs, effectively controlling both attribution and interpretation of texts. The findings have important AI-safety implications, including potential uses for defense against jailbreaks and for warning systems, and raise questions about the generality of self-recognition across larger models.

Abstract

It has been reported that LLMs can recognize their own writing. As this has potential implications for AI safety, yet is relatively understudied, we investigate the phenomenon, seeking to establish whether it robustly occurs at the behavioral level, how the observed behavior is achieved, and whether it can be controlled. First, we find that the Llama3-8b-Instruct chat model - but not the base Llama3-8b model - can reliably distinguish its own outputs from those of humans, and present evidence that the chat model is likely using its experience with its own outputs, acquired during post-training, to succeed at the writing recognition task. Second, we identify a vector in the residual stream of the model that is differentially activated when the model makes a correct self-written-text recognition judgment, show that the vector activates in response to information relevant to self-authorship, present evidence that the vector is related to the concept of "self" in the model, and demonstrate that the vector is causally related to the model's ability to perceive and assert self-authorship. Finally, we show that the vector can be used to control both the model's behavior and its perception, steering the model to claim or disclaim authorship by applying the vector to the model's output as it generates it, and steering the model to believe or disbelieve it wrote arbitrary texts by applying the vector to them as the model reads them.
Paper Structure (30 sections, 15 figures, 6 tables)

This paper contains 30 sections, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Llama3-8b-Instruct Paired presentation self-recognition accuracy with and without length normalization.
  • Figure 2: Llama3-8b-base Paired presentation self-recognition accuracy, normalized texts.
  • Figure 3: Steering effectiveness by layer and multiplier for Individual presentation paradigm test set 1. +/-Vector: positive/negative steering. Black dots are unsteered model. Colors indicate multipliers; for example, as can be seen in the upper right, positive steering with multiplier 10 and layer 16 led the model to claim authorship of a text that it did not write $\sim$80% of the time, as compared with the unsteered model's $\sim$35%.
  • Figure 4: Aggregate steering effectiveness by layer and multiplier in two different datasets (left and right). 100 = Complete steering effectiveness in the intended direction. Values below 0 mean that the steered model was less likely to claim (for positive steering) or deny (for negative steering) authorship than the unsteered model, and are generally indicative of degenerate output.
  • Figure 5: Effect of projecting self-recognition vector out of output token on three different datasets. In each case, zeroing out the vector at layer 16 reduces the probability that the model will claim self authorship (irrespective of true authorship) from $\sim$50% to under 30%.
  • ...and 10 more figures