Table of Contents
Fetching ...

Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story

Vladislav Pedashenko, Laida Kushnareva, Yana Khassan Nibal, Eduard Tulchinskii, Kristian Kuznetsov, Vladislav Zharchinskii, Yury Maximov, Irina Piontkovskaya

TL;DR

This work treats intrinsic dimension (ID) as a geometry-based lens on text representations that is largely orthogonal to prediction entropy. By grounding ID in interpretable text properties via cross-encoder analysis, sparse autoencoders (SAEs), and linguistic diagnostics, the authors show that scientific prose tends to have low ID while fiction and opinionated writing exhibit higher ID. They demonstrate causal links between SAE-derived features (e.g., formal tone vs. personal/narrative signaling) and shifts in ID through steering experiments, providing actionable insights for evaluation and data construction. The study highlights domain- and style-dependent ID patterns and argues for using ID alongside entropy to better capture textual complexity and guide model assessment and training. Overall, ID emerges as a complementary diagnostic that helps distinguish representational simplicity in scientific text from the richer degrees of freedom present in narrative and opinionated writing.

Abstract

Intrinsic dimension (ID) is an important tool in modern LLM analysis, informing studies of training dynamics, scaling behavior, and dataset structure, yet its textual determinants remain underexplored. We provide the first comprehensive study grounding ID in interpretable text properties through cross-encoder analysis, linguistic features, and sparse autoencoders (SAEs). In this work, we establish three key findings. First, ID is complementary to entropy-based metrics: after controlling for length, the two are uncorrelated, with ID capturing geometric complexity orthogonal to prediction quality. Second, ID exhibits robust genre stratification: scientific prose shows low ID (~8), encyclopedic content medium ID (~9), and creative/opinion writing high ID (~10.5) across all models tested. This reveals that contemporary LLMs find scientific text "representationally simple" while fiction requires additional degrees of freedom. Third, using SAEs, we identify causal features: scientific signals (formal tone, report templates, statistics) reduce ID; humanized signals (personalization, emotion, narrative) increase it. Steering experiments confirm these effects are causal. Thus, for contemporary models, scientific writing appears comparatively "easy", whereas fiction, opinion, and affect add representational degrees of freedom. Our multi-faceted analysis provides practical guidance for the proper use of ID and the sound interpretation of ID-based results.

Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story

TL;DR

This work treats intrinsic dimension (ID) as a geometry-based lens on text representations that is largely orthogonal to prediction entropy. By grounding ID in interpretable text properties via cross-encoder analysis, sparse autoencoders (SAEs), and linguistic diagnostics, the authors show that scientific prose tends to have low ID while fiction and opinionated writing exhibit higher ID. They demonstrate causal links between SAE-derived features (e.g., formal tone vs. personal/narrative signaling) and shifts in ID through steering experiments, providing actionable insights for evaluation and data construction. The study highlights domain- and style-dependent ID patterns and argues for using ID alongside entropy to better capture textual complexity and guide model assessment and training. Overall, ID emerges as a complementary diagnostic that helps distinguish representational simplicity in scientific text from the richer degrees of freedom present in narrative and opinionated writing.

Abstract

Intrinsic dimension (ID) is an important tool in modern LLM analysis, informing studies of training dynamics, scaling behavior, and dataset structure, yet its textual determinants remain underexplored. We provide the first comprehensive study grounding ID in interpretable text properties through cross-encoder analysis, linguistic features, and sparse autoencoders (SAEs). In this work, we establish three key findings. First, ID is complementary to entropy-based metrics: after controlling for length, the two are uncorrelated, with ID capturing geometric complexity orthogonal to prediction quality. Second, ID exhibits robust genre stratification: scientific prose shows low ID (~8), encyclopedic content medium ID (~9), and creative/opinion writing high ID (~10.5) across all models tested. This reveals that contemporary LLMs find scientific text "representationally simple" while fiction requires additional degrees of freedom. Third, using SAEs, we identify causal features: scientific signals (formal tone, report templates, statistics) reduce ID; humanized signals (personalization, emotion, narrative) increase it. Steering experiments confirm these effects are causal. Thus, for contemporary models, scientific writing appears comparatively "easy", whereas fiction, opinion, and affect add representational degrees of freedom. Our multi-faceted analysis provides practical guidance for the proper use of ID and the sound interpretation of ID-based results.

Paper Structure

This paper contains 49 sections, 18 equations, 25 figures, 10 tables.

Figures (25)

  • Figure 1: Intrinsic dimension characterizes the geometry of hidden representations (blue points on the leftmost frame), while prediction-based metrics such as entropy and cross-entropy depend on the unembedding dictionary (red points on the leftmost frame). Sequences of embeddings with the same intrinsic dimension may yield very different prediction entropies, depending on how densely the unembedding vectors populate the surrounding space (i.e., on the number of close neighbors, shown by grey connections). Note significant correlation between PHD and Cross-Entropy loss (center frame) and weak correlation between PHD and Cross-Entropy loss, normalized by text length in gemma tokens (rightmost frame).
  • Figure 2: Correlations among various ID estimators. (G) denotes ID estimators upon Gemma, (R) - RoBERTa, (Q) - Qwen. Note that PHD estimators upon all three models have correlation more than $0.5$ with all other estimators, making it a solid compromise. See Appendix \ref{['sec:id_scatterplots']} for scatterplots and further discussion.
  • Figure 3: PHD and gzip
  • Figure 4: Top-10 features from TAACO with the strongest correlation with PHD(Gemma). See Appendix \ref{['sec:id_taaco']} for similar barplots with MLE, TLE, TwoNN.
  • Figure 5: PHD(Gemma) by source with group differentiation.
  • ...and 20 more figures