Table of Contents
Fetching ...

DMAP: A Distribution Map for Text

Tom Kempton, Julia Rozanova, Parameswaran Kamalaruban, Maeve Madigan, Karolina Wresilo, Yoann L. Launay, David Sutton, Stuart Burrell

TL;DR

DMAP provides a principled, model-agnostic framework for mapping next-token probability distributions to a unit interval, solving the contextualization problem that plagues perplexity-based analysis. For each position i, it constructs an interval $I_i=[a_i,b_i]$ with $a_i=\\sum_{v\\in V^+_i} p(v|w_{1:i-1})$ and $b_i=a_i+p(w_i|w_{1:i-1})$; samples $x_i\\sim U(I_i)$, and an entropy-weighted variant uses $h_i=-\\sum_v p(v|w_{1:i-1})\\log p(v|w_{1:i-1})$ and $h'_i=\\max\{h_i,\\lambda\\}$ to form $\\hat{D}(\\underline w)=\\frac{\\sum_{i=1}^T h'_i \\frac{\\chi_{I_i}}{|I_i|}}{\\sum_{i=1}^T h'_i}$, yielding a density on $[0,1]$. When text is pure-sampled, the DMAP samples are i.i.d. Uniform$(0,1)$, enabling standard $\\chi^2$-based tests on binned histograms to assess generation integrity. The authors demonstrate three case studies: parameter validation, probability-curvature-based detector design insights, and forensic analysis of post-training data signatures in instruction-tuned models, all with an emphasis on computational efficiency and visualization. DMAP thus provides a unified statistical lens for comparing texts across models and generation settings, with potential applications in data curation and calibration of aligned models.

Abstract

Large Language Models (LLMs) are a powerful tool for statistical text analysis, with derived sequences of next-token probability distributions offering a wealth of information. Extracting this signal typically relies on metrics such as perplexity, which do not adequately account for context; how one should interpret a given next-token probability is dependent on the number of reasonable choices encoded by the shape of the conditional distribution. In this work, we present DMAP, a mathematically grounded method that maps a text, via a language model, to a set of samples in the unit interval that jointly encode rank and probability information. This representation enables efficient, model-agnostic analysis and supports a range of applications. We illustrate its utility through three case studies: (i) validation of generation parameters to ensure data integrity, (ii) examining the role of probability curvature in machine-generated text detection, and (iii) a forensic analysis revealing statistical fingerprints left in downstream models that have been subject to post-training on synthetic data. Our results demonstrate that DMAP offers a unified statistical view of text that is simple to compute on consumer hardware, widely applicable, and provides a foundation for further research into text analysis with LLMs.

DMAP: A Distribution Map for Text

TL;DR

DMAP provides a principled, model-agnostic framework for mapping next-token probability distributions to a unit interval, solving the contextualization problem that plagues perplexity-based analysis. For each position i, it constructs an interval with and ; samples , and an entropy-weighted variant uses and to form , yielding a density on . When text is pure-sampled, the DMAP samples are i.i.d. Uniform, enabling standard -based tests on binned histograms to assess generation integrity. The authors demonstrate three case studies: parameter validation, probability-curvature-based detector design insights, and forensic analysis of post-training data signatures in instruction-tuned models, all with an emphasis on computational efficiency and visualization. DMAP thus provides a unified statistical lens for comparing texts across models and generation settings, with potential applications in data curation and calibration of aligned models.

Abstract

Large Language Models (LLMs) are a powerful tool for statistical text analysis, with derived sequences of next-token probability distributions offering a wealth of information. Extracting this signal typically relies on metrics such as perplexity, which do not adequately account for context; how one should interpret a given next-token probability is dependent on the number of reasonable choices encoded by the shape of the conditional distribution. In this work, we present DMAP, a mathematically grounded method that maps a text, via a language model, to a set of samples in the unit interval that jointly encode rank and probability information. This representation enables efficient, model-agnostic analysis and supports a range of applications. We illustrate its utility through three case studies: (i) validation of generation parameters to ensure data integrity, (ii) examining the role of probability curvature in machine-generated text detection, and (iii) a forensic analysis revealing statistical fingerprints left in downstream models that have been subject to post-training on synthetic data. Our results demonstrate that DMAP offers a unified statistical view of text that is simple to compute on consumer hardware, widely applicable, and provides a foundation for further research into text analysis with LLMs.
Paper Structure (46 sections, 2 theorems, 15 equations, 23 figures, 1 table)

This paper contains 46 sections, 2 theorems, 15 equations, 23 figures, 1 table.

Key Result

Proposition 3.1

When generating a text $w_1\cdots w_T$ by pure sampling from language model $p$, the corresponding sequence $x_1\cdots x_T$ obtained by applying DMAP to $w_1\cdots w_T$ with evaluator model $p$ will be independent and identically distributed (i.i.d.) according to the uniform measure on $[0,1]$.

Figures (23)

  • Figure 1: The DMAP algorithm. Given a text $\underline w$ of length $T$, this diagram illustrates how DMAP generates a collection of samples $x_1\cdots x_T$ in $[0, 1]$. These may be analyzed quantitatively or qualitatively by splitting $[0,1]$ into equal sized bins and plotting a histogram, as illustrated in Figure \ref{['fig:comparison']}. We initialize $i=1$. Our experiments demonstrate these visualizations identify decoding parameters (top-$p$, top-$k$, temperature), yield insights into black box machine-generated text detection algorithms based on probability curvature, and reveal statistical fingerprints left by performing supervised fine-tuning (SFT) on synthetic data.
  • Figure 2: Illustrative DMAP histograms. The first row shows plots of XSum data narayan2018don generated by OPT-125m, evaluated by OPT-125m. The generation strategies (left to right) are (a) pure sampling, (b) top-$p$ = 0.8 sampling, (c) temperature $\tau = 0.8$ sampling, and (d) top-$k=50$. The second row shows various different types of text evaluated by DMAP: (e) a news dataset of human text from RAID, (f) text generated by Mistral 7B jiang2023mistral7b using pure sampling (top-$p$=1), (g) text generated by Mistral 7B Instruct, and (h) text generated by ChatGPT from the Ghostbusters dataset verma2024ghostbuster. OTP-125m was used as the scoring model to generate DMAP samples. See Appendix \ref{['sec:furtherplots']} for examples with larger evaluation models.
  • Figure 3: Quantitative validation of decoding parameters. (a) shows a DMAP plot for the black-box case of Llama 3.1 8B generated text evaluated by Mistral 7B. (b) shows a DMAP plot for the white-box case of Llama 3.1 8B generated text evaluated by Llama 3.1 8B. (c) plots the $log_{10}$ p-values resulting from our $\chi^2$ uniformity test. This demonstrates how quantitative evidence can be extracted from DMAP samples to investigate hypothesis about possible generated strategies. For example, (c) tells us that after evaluating $10000$ tokens of Llama generated text with Mistral-7B as the evaluation model, the probability that Mistral would produce text with such an extreme $\chi^2$ distribution is less than $10^{-10}$. We can conclude that it is not plausible that the text under review was generated by pure sampling from Mistral 7B.
  • Figure 4: Using DMAP to investigate the effect of SFT post-training on synthetic data. DMAP plots generated by pure sampling with the evaluator model (OPT-125) on text generated by Pythia 1B models with (a) no fine-tuning, (b) fine-tuned on OASST2 human data kopf2023, (c) fine-tuned on OASST2 with responses regenerated by Llama 3.1 8B at temperature 0.7, (d) fine-tuned on OASST2 with responses regenerated by Llama 3.1 8B at temperature 1.0.
  • Figure 5: Empirical convergence analysis of DMAP with $40$-bin histograms. This figure illustrates the convergence of $40$-bin DMAP histograms as the number of tokens, and thus DMAP samples, increases from $200$ to $20,000$. At extremely low sample sizes we see high variability and noise, while the strong characteristic shapes we expect emerge as the number of tokens increases. We recommend choosing the number of bins as a function of the number of tokens being evaluated, for example using the Terrell-Scott rule, so as to mitigate this noise.
  • ...and 18 more figures

Theorems & Definitions (3)

  • Proposition 3.1
  • proof
  • Proposition D.1