From Internal Representations to Text Quality: A Geometric Approach to LLM Evaluation
Viacheslav Yusupov, Danil Maksimov, Ameliia Alaeva, Anna Vasileva, Anna Antipina, Tatyana Zaitseva, Alina Ermilova, Evgeny Burnaev, Egor Shvetsov
TL;DR
This work demonstrates that internal geometric properties of LLM representations can serve as reliable, reference-free proxies for text quality. By evaluating six tester models across eight generators, the authors show that Intrinsic Dimensionality, Effective Rank, and Maximum Explainable Variance consistently rank generated text in the same order, correlating with established external metrics like BLEURT and GPT-2 perplexity. The framework applies across English, German, and Russian, including autoregressive and diffusion-based models, suggesting that these geometric cues reflect intrinsic text properties rather than model idiosyncrasies. The results support deploying lightweight, annotation-free evaluation pipelines that leverage internal representations to assess naturalness and quality at scale. This approach promises practical benefits for rapid model development and automated benchmarking in diverse linguistic contexts.
Abstract
This paper bridges internal and external analysis approaches to large language models (LLMs) by demonstrating that geometric properties of internal model representations serve as reliable proxies for evaluating generated text quality. We validate a set of metrics including Maximum Explainable Variance, Effective Rank, Intrinsic Dimensionality, MAUVE score, and Schatten Norms measured across different layers of LLMs, demonstrating that Intrinsic Dimensionality and Effective Rank can serve as universal assessments of text naturalness and quality. Our key finding reveals that different models consistently rank text from various sources in the same order based on these geometric properties, indicating that these metrics reflect inherent text characteristics rather than model-specific artifacts. This allows a reference-free text quality evaluation that does not require human-annotated datasets, offering practical advantages for automated evaluation pipelines.
