Table of Contents
Fetching ...

Beyond Tokens in Language Models: Interpreting Activations through Text Genre Chunks

Éloïse Benito-Rodriguez, Einar Urdshals, Jasmina Nasufi, Nicky Pochinkov

TL;DR

This work addresses the interpretability of large language models by predicting text genre from activations at the chunk level, rather than single-token outputs. Using Mistral-7B-Instruct-v2 and the CORE corpus, the authors train shallow probes on layer-wise activations to recover chunk category labels, achieving high macro F1 scores (up to $0.98$ on synthetic data and $0.71$ on CORE) and demonstrating meaningful, though dataset-dependent, structure in the residual stream. Dimensionality reduction via PHATE and embedding analyses reveal partial cluster-genre alignment, supporting the hypothesis that high-level multi-token text structures are encoded in transformer activations. The results establish a proof of concept for chunk-scale interpretability and lay groundwork for broader monitoring and trust-enhancement tools for LLMs, while outlining limitations and avenues for future work.

Abstract

Understanding Large Language Models (LLMs) is key to ensure their safe and beneficial deployment. This task is complicated by the difficulty of interpretability of LLM structures, and the inability to have all their outputs human-evaluated. In this paper, we present the first step towards a predictive framework, where the genre of a text used to prompt an LLM, is predicted based on its activations. Using Mistral-7B and two datasets, we show that genre can be extracted with F1-scores of up to 98% and 71% using scikit-learn classifiers. Across both datasets, results consistently outperform the control task, providing a proof of concept that text genres can be inferred from LLMs with shallow learning models.

Beyond Tokens in Language Models: Interpreting Activations through Text Genre Chunks

TL;DR

This work addresses the interpretability of large language models by predicting text genre from activations at the chunk level, rather than single-token outputs. Using Mistral-7B-Instruct-v2 and the CORE corpus, the authors train shallow probes on layer-wise activations to recover chunk category labels, achieving high macro F1 scores (up to on synthetic data and on CORE) and demonstrating meaningful, though dataset-dependent, structure in the residual stream. Dimensionality reduction via PHATE and embedding analyses reveal partial cluster-genre alignment, supporting the hypothesis that high-level multi-token text structures are encoded in transformer activations. The results establish a proof of concept for chunk-scale interpretability and lay groundwork for broader monitoring and trust-enhancement tools for LLMs, while outlining limitations and avenues for future work.

Abstract

Understanding Large Language Models (LLMs) is key to ensure their safe and beneficial deployment. This task is complicated by the difficulty of interpretability of LLM structures, and the inability to have all their outputs human-evaluated. In this paper, we present the first step towards a predictive framework, where the genre of a text used to prompt an LLM, is predicted based on its activations. Using Mistral-7B and two datasets, we show that genre can be extracted with F1-scores of up to 98% and 71% using scikit-learn classifiers. Across both datasets, results consistently outperform the control task, providing a proof of concept that text genres can be inferred from LLMs with shallow learning models.

Paper Structure

This paper contains 16 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Generation of the labelled dataset.
  • Figure 2: This figure illustrates the training procedure of prediction models on the task of predicting the category of a text section. The activation $a_i^j$ of chunk $c_i$ in the $j$-th layer is extracted from Mistral-7B.
  • Figure 3: The PHATE dimensionality reduction for the synthetic dataset. We observe that there is some correspondence between the clusters and the labeled categories.
  • Figure 4: The PHATE dimensionality reduction for the CORE dataset. We see that there is a lot of overlap between clusters and the labelled categories.
  • Figure 5: F1-score performance as a function of the layer fraction for prediction models on the task of predicting the category of a text section. The activation $a_i^j$ of chunk $c_i$ at the $j$-th layer has been extracted from Mistral-7B.