Table of Contents
Fetching ...

Extracting Paragraphs from LLM Token Activations

Nicholas Pochinkov, Angelo Benoit, Lovkush Agarwal, Zainab Ali Majid, Lucile Ter-Minassian

TL;DR

It is demonstrated that patching single-token activations can transfer significant information about the context of the following paragraph, providing further insights into the model's capacity to plan ahead.

Abstract

Generative large language models (LLMs) excel in natural language processing tasks, yet their inner workings remain underexplored beyond token-level predictions. This study investigates the degree to which these models decide the content of a paragraph at its onset, shedding light on their contextual understanding. By examining the information encoded in single-token activations, specifically the "\textbackslash n\textbackslash n" double newline token, we demonstrate that patching these activations can transfer significant information about the context of the following paragraph, providing further insights into the model's capacity to plan ahead.

Extracting Paragraphs from LLM Token Activations

TL;DR

It is demonstrated that patching single-token activations can transfer significant information about the context of the following paragraph, providing further insights into the model's capacity to plan ahead.

Abstract

Generative large language models (LLMs) excel in natural language processing tasks, yet their inner workings remain underexplored beyond token-level predictions. This study investigates the degree to which these models decide the content of a paragraph at its onset, shedding light on their contextual understanding. By examining the information encoded in single-token activations, specifically the "\textbackslash n\textbackslash n" double newline token, we demonstrate that patching these activations can transfer significant information about the context of the following paragraph, providing further insights into the model's capacity to plan ahead.
Paper Structure (8 sections, 6 figures, 2 tables)

This paper contains 8 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: (Left): Heat map of the average attention weights around the topic change. (Right): Cosine similarity between attention activations. Results averaged over 1,000 model-generated original contexts, sharing a common structure.
  • Figure 2: Diagram describing our approach. After collecting activations at the transition token on the original context model, we transfer these to all layers of the neutrally-prompted model.
  • Figure 3: Context similarity visualised with T-SNE. Results over 1,000 original contexts.
  • Figure 4: Distribution of cosine distances to the original generation. Contexts are summarized using sentence transformers, and distributions are taken over 1,000 original contexts.
  • Figure 5: Cosine similarity between attention activations across all 42 layers of the model. Results averaged over 1,000 model-generated original contexts, sharing a common structure.
  • ...and 1 more figures