DMAP: A Distribution Map for Text
Tom Kempton, Julia Rozanova, Parameswaran Kamalaruban, Maeve Madigan, Karolina Wresilo, Yoann L. Launay, David Sutton, Stuart Burrell
TL;DR
DMAP provides a principled, model-agnostic framework for mapping next-token probability distributions to a unit interval, solving the contextualization problem that plagues perplexity-based analysis. For each position i, it constructs an interval $I_i=[a_i,b_i]$ with $a_i=\\sum_{v\\in V^+_i} p(v|w_{1:i-1})$ and $b_i=a_i+p(w_i|w_{1:i-1})$; samples $x_i\\sim U(I_i)$, and an entropy-weighted variant uses $h_i=-\\sum_v p(v|w_{1:i-1})\\log p(v|w_{1:i-1})$ and $h'_i=\\max\{h_i,\\lambda\\}$ to form $\\hat{D}(\\underline w)=\\frac{\\sum_{i=1}^T h'_i \\frac{\\chi_{I_i}}{|I_i|}}{\\sum_{i=1}^T h'_i}$, yielding a density on $[0,1]$. When text is pure-sampled, the DMAP samples are i.i.d. Uniform$(0,1)$, enabling standard $\\chi^2$-based tests on binned histograms to assess generation integrity. The authors demonstrate three case studies: parameter validation, probability-curvature-based detector design insights, and forensic analysis of post-training data signatures in instruction-tuned models, all with an emphasis on computational efficiency and visualization. DMAP thus provides a unified statistical lens for comparing texts across models and generation settings, with potential applications in data curation and calibration of aligned models.
Abstract
Large Language Models (LLMs) are a powerful tool for statistical text analysis, with derived sequences of next-token probability distributions offering a wealth of information. Extracting this signal typically relies on metrics such as perplexity, which do not adequately account for context; how one should interpret a given next-token probability is dependent on the number of reasonable choices encoded by the shape of the conditional distribution. In this work, we present DMAP, a mathematically grounded method that maps a text, via a language model, to a set of samples in the unit interval that jointly encode rank and probability information. This representation enables efficient, model-agnostic analysis and supports a range of applications. We illustrate its utility through three case studies: (i) validation of generation parameters to ensure data integrity, (ii) examining the role of probability curvature in machine-generated text detection, and (iii) a forensic analysis revealing statistical fingerprints left in downstream models that have been subject to post-training on synthetic data. Our results demonstrate that DMAP offers a unified statistical view of text that is simple to compute on consumer hardware, widely applicable, and provides a foundation for further research into text analysis with LLMs.
