LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers
Anton Razzhigaev, Matvey Mikhalchuk, Temurbek Rahmatullaev, Elizaveta Goncharova, Polina Druzhinina, Ivan Oseledets, Andrey Kuznetsov
TL;DR
The paper investigates how Transformer-based LLMs encode and retain long-range contextual information, revealing that seemingly trivial tokens can carry outsized contextual signals. It introduces LLM-Microscope, an open-source toolkit that measures token-level nonlinearity, assesses contextual memory via prefix reconstruction, analyzes intermediate-layer contributions with a Logit Lens, and estimates intrinsic dimensionality of representations. Nonlinearity and contextualization are quantified by $E_i^l = ||A^* h_i^l - h_i^{l+1}||_2$ and $C_i = -log P(w_1, ..., w_{i-1} | e_i)$, and the results show a strong correlation between linearity and contextualization, with filler tokens like punctuation and determiners being highly contextualized. Empirical evaluations on MMLU and BABILong-4k demonstrate that removing high-context tokens degrades performance, even when selectively deleting tokens deemed least relevant by GPT-4o, underscoring the hidden importance of such tokens for coherent long-context understanding. The framework, available as open-source software, provides researchers with scalable tools for diagnosing context handling and guiding design choices to improve long-context reasoning in LLMs.
Abstract
We introduce methods to quantify how Large Language Models (LLMs) encode and store contextual information, revealing that tokens often seen as minor (e.g., determiners, punctuation) carry surprisingly high context. Notably, removing these tokens -- especially stopwords, articles, and commas -- consistently degrades performance on MMLU and BABILong-4k, even if removing only irrelevant tokens. Our analysis also shows a strong correlation between contextualization and linearity, where linearity measures how closely the transformation from one layer's embeddings to the next can be approximated by a single linear mapping. These findings underscore the hidden importance of filler tokens in maintaining context. For further exploration, we present LLM-Microscope, an open-source toolkit that assesses token-level nonlinearity, evaluates contextual memory, visualizes intermediate layer contributions (via an adapted Logit Lens), and measures the intrinsic dimensionality of representations. This toolkit illuminates how seemingly trivial tokens can be critical for long-range understanding.
