Table of Contents
Fetching ...

From Internal Representations to Text Quality: A Geometric Approach to LLM Evaluation

Viacheslav Yusupov, Danil Maksimov, Ameliia Alaeva, Anna Vasileva, Anna Antipina, Tatyana Zaitseva, Alina Ermilova, Evgeny Burnaev, Egor Shvetsov

TL;DR

This work demonstrates that internal geometric properties of LLM representations can serve as reliable, reference-free proxies for text quality. By evaluating six tester models across eight generators, the authors show that Intrinsic Dimensionality, Effective Rank, and Maximum Explainable Variance consistently rank generated text in the same order, correlating with established external metrics like BLEURT and GPT-2 perplexity. The framework applies across English, German, and Russian, including autoregressive and diffusion-based models, suggesting that these geometric cues reflect intrinsic text properties rather than model idiosyncrasies. The results support deploying lightweight, annotation-free evaluation pipelines that leverage internal representations to assess naturalness and quality at scale. This approach promises practical benefits for rapid model development and automated benchmarking in diverse linguistic contexts.

Abstract

This paper bridges internal and external analysis approaches to large language models (LLMs) by demonstrating that geometric properties of internal model representations serve as reliable proxies for evaluating generated text quality. We validate a set of metrics including Maximum Explainable Variance, Effective Rank, Intrinsic Dimensionality, MAUVE score, and Schatten Norms measured across different layers of LLMs, demonstrating that Intrinsic Dimensionality and Effective Rank can serve as universal assessments of text naturalness and quality. Our key finding reveals that different models consistently rank text from various sources in the same order based on these geometric properties, indicating that these metrics reflect inherent text characteristics rather than model-specific artifacts. This allows a reference-free text quality evaluation that does not require human-annotated datasets, offering practical advantages for automated evaluation pipelines.

From Internal Representations to Text Quality: A Geometric Approach to LLM Evaluation

TL;DR

This work demonstrates that internal geometric properties of LLM representations can serve as reliable, reference-free proxies for text quality. By evaluating six tester models across eight generators, the authors show that Intrinsic Dimensionality, Effective Rank, and Maximum Explainable Variance consistently rank generated text in the same order, correlating with established external metrics like BLEURT and GPT-2 perplexity. The framework applies across English, German, and Russian, including autoregressive and diffusion-based models, suggesting that these geometric cues reflect intrinsic text properties rather than model idiosyncrasies. The results support deploying lightweight, annotation-free evaluation pipelines that leverage internal representations to assess naturalness and quality at scale. This approach promises practical benefits for rapid model development and automated benchmarking in diverse linguistic contexts.

Abstract

This paper bridges internal and external analysis approaches to large language models (LLMs) by demonstrating that geometric properties of internal model representations serve as reliable proxies for evaluating generated text quality. We validate a set of metrics including Maximum Explainable Variance, Effective Rank, Intrinsic Dimensionality, MAUVE score, and Schatten Norms measured across different layers of LLMs, demonstrating that Intrinsic Dimensionality and Effective Rank can serve as universal assessments of text naturalness and quality. Our key finding reveals that different models consistently rank text from various sources in the same order based on these geometric properties, indicating that these metrics reflect inherent text characteristics rather than model-specific artifacts. This allows a reference-free text quality evaluation that does not require human-annotated datasets, offering practical advantages for automated evaluation pipelines.

Paper Structure

This paper contains 21 sections, 7 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: The ranking of eight generators $\mathcal{G}$ via four tester models $\mathcal{T}$ with different sizes from $0.5$B to $8$B (Qwen2 0.5B, Gemma 2B, Llama3.1 8B Instruct and diffusion LLaDA 8B) and three geometric metrics (Resultant Length, Effective Rank and CorrInt). As could be seen, the rankings of the generators models are similar.
  • Figure 2: The average across layers (left two columns) and layer-wise (right two columns) metrics: MEV, Effective Rank and CorrInt for with text generated by various models models $\mathcal{G}$ and tester models $\mathcal{T}$: Qwen2 0.5B and Llama3.1 8B Instruct. Original means human written text, while all generated text simply represents rewritten via LLMs original text, which preserves semantic meaning.
  • Figure 3: This Spearman correlation demonstrates similarity among different geometric $\mathbf{R}$ scores in terms of ranking texts generated by various models. Results are aggregated across both tester and generator models. Asterisk indicates FDR-corrected p-value $\leq 0.05$.
  • Figure 4: Here we demonstrate Spearman correlation between geometric $\mathbf{R}$ scores and text quality metrics. Results are aggregated across both tester and generator models.
  • Figure 5: The comparison of Original and generated by Qwen2.5 7B synthetic texts on Russian (Ru), German (De) and English (Eng) for Maximum Explainable Variance \ref{['eq:mev']} and various tester models $\mathcal{T}$.
  • ...and 11 more figures