Beyond Tokens in Language Models: Interpreting Activations through Text Genre Chunks
Éloïse Benito-Rodriguez, Einar Urdshals, Jasmina Nasufi, Nicky Pochinkov
TL;DR
This work addresses the interpretability of large language models by predicting text genre from activations at the chunk level, rather than single-token outputs. Using Mistral-7B-Instruct-v2 and the CORE corpus, the authors train shallow probes on layer-wise activations to recover chunk category labels, achieving high macro F1 scores (up to $0.98$ on synthetic data and $0.71$ on CORE) and demonstrating meaningful, though dataset-dependent, structure in the residual stream. Dimensionality reduction via PHATE and embedding analyses reveal partial cluster-genre alignment, supporting the hypothesis that high-level multi-token text structures are encoded in transformer activations. The results establish a proof of concept for chunk-scale interpretability and lay groundwork for broader monitoring and trust-enhancement tools for LLMs, while outlining limitations and avenues for future work.
Abstract
Understanding Large Language Models (LLMs) is key to ensure their safe and beneficial deployment. This task is complicated by the difficulty of interpretability of LLM structures, and the inability to have all their outputs human-evaluated. In this paper, we present the first step towards a predictive framework, where the genre of a text used to prompt an LLM, is predicted based on its activations. Using Mistral-7B and two datasets, we show that genre can be extracted with F1-scores of up to 98% and 71% using scikit-learn classifiers. Across both datasets, results consistently outperform the control task, providing a proof of concept that text genres can be inferred from LLMs with shallow learning models.
