Table of Contents
Fetching ...

Measuring Intrinsic Dimension of Token Embeddings

Takuya Kataiwa, Cho Hakaze, Tetsushi Ohki

TL;DR

Measuring Intrinsic Dimension of Token Embeddings investigates how many dimensions in token embeddings are truly necessary, by estimating Local Intrinsic Dimension (LID) and global ID for both traditional word embeddings and large-scale language models. The study finds that embedding spaces lie on low-dimensional manifolds (ID around $10$–$30$ for ED = $300$), revealing substantial redundancy across scales, with redundancy stabilizing near $98\%$ in large models. ID drops rapidly in the early stages of training and then stabilizes, suggesting a compact core representation is learned quickly. Importantly, ID-guided LoRA rank selection on embedding layers yields efficiency gains and preserves perplexity, indicating that ID can inform practical compression strategies for NLP systems.

Abstract

In this study, we measure the Intrinsic Dimension (ID) of token embedding to estimate the intrinsic dimensions of the manifolds spanned by the representations, so as to evaluate their redundancy quantitatively compared to their extrinsic dimensionality. In detail, (1) we estimate the ID of token embeddings in small-scale language models and also modern large language models, finding that the embedding spaces often reside on lower-dimensional manifolds compared to their extrinsic dimensionality; (2) we measure the ID across various model sizes and observe an increase in redundancy rates as the model scale grows; (3) we measure the dynamics of IDs during the training process, and find a rapid ID drop in the early stages of training. Moreover, (4) when LoRA is applied to the embedding layers, we observe a sudden drop in perplexity around the estimated IDs, suggesting that the ID can serve as a useful guideline for LoRA application.

Measuring Intrinsic Dimension of Token Embeddings

TL;DR

Measuring Intrinsic Dimension of Token Embeddings investigates how many dimensions in token embeddings are truly necessary, by estimating Local Intrinsic Dimension (LID) and global ID for both traditional word embeddings and large-scale language models. The study finds that embedding spaces lie on low-dimensional manifolds (ID around for ED = ), revealing substantial redundancy across scales, with redundancy stabilizing near in large models. ID drops rapidly in the early stages of training and then stabilizes, suggesting a compact core representation is learned quickly. Importantly, ID-guided LoRA rank selection on embedding layers yields efficiency gains and preserves perplexity, indicating that ID can inform practical compression strategies for NLP systems.

Abstract

In this study, we measure the Intrinsic Dimension (ID) of token embedding to estimate the intrinsic dimensions of the manifolds spanned by the representations, so as to evaluate their redundancy quantitatively compared to their extrinsic dimensionality. In detail, (1) we estimate the ID of token embeddings in small-scale language models and also modern large language models, finding that the embedding spaces often reside on lower-dimensional manifolds compared to their extrinsic dimensionality; (2) we measure the ID across various model sizes and observe an increase in redundancy rates as the model scale grows; (3) we measure the dynamics of IDs during the training process, and find a rapid ID drop in the early stages of training. Moreover, (4) when LoRA is applied to the embedding layers, we observe a sudden drop in perplexity around the estimated IDs, suggesting that the ID can serve as a useful guideline for LoRA application.

Paper Structure

This paper contains 15 sections, 3 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Kernel densities of LID values.
  • Figure 2: Redundancy Ratio against model parameters.
  • Figure 3: Dynamics of ID against the training steps.
  • Figure 4: Validation perplexity against LoRA inner dimensions on pythia-410m.