Table of Contents
Fetching ...

The Shape of Learning: Anisotropy and Intrinsic Dimensions in Transformer-Based Models

Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Ivan Oseledets, Denis Dimitrov, Andrey Kuznetsov

TL;DR

The paper investigates anisotropy and intrinsic dimensionality in transformer embeddings, revealing a distinct bell-shaped anisotropy profile in decoders versus encoders. It shows that anisotropy grows during training in decoders and that decoder embeddings undergo a two-phase intrinsic-dimension trajectory, expanding early and compressing late. The methodology combines SVD-based anisotropy, a distance-ratio intrinsic-dimension estimator, and cross-method validation across multiple models and training stages. These findings illuminate fundamental differences between encoder and decoder representations and suggest targeted training insights. The work advances interpretability and could influence architecture-aware training strategies.

Abstract

In this study, we present an investigation into the anisotropy dynamics and intrinsic dimension of embeddings in transformer architectures, focusing on the dichotomy between encoders and decoders. Our findings reveal that the anisotropy profile in transformer decoders exhibits a distinct bell-shaped curve, with the highest anisotropy concentrations in the middle layers. This pattern diverges from the more uniformly distributed anisotropy observed in encoders. In addition, we found that the intrinsic dimension of embeddings increases in the initial phases of training, indicating an expansion into higher-dimensional space. Which is then followed by a compression phase towards the end of training with dimensionality decrease, suggesting a refinement into more compact representations. Our results provide fresh insights to the understanding of encoders and decoders embedding properties.

The Shape of Learning: Anisotropy and Intrinsic Dimensions in Transformer-Based Models

TL;DR

The paper investigates anisotropy and intrinsic dimensionality in transformer embeddings, revealing a distinct bell-shaped anisotropy profile in decoders versus encoders. It shows that anisotropy grows during training in decoders and that decoder embeddings undergo a two-phase intrinsic-dimension trajectory, expanding early and compressing late. The methodology combines SVD-based anisotropy, a distance-ratio intrinsic-dimension estimator, and cross-method validation across multiple models and training stages. These findings illuminate fundamental differences between encoder and decoder representations and suggest targeted training insights. The work advances interpretability and could influence architecture-aware training strategies.

Abstract

In this study, we present an investigation into the anisotropy dynamics and intrinsic dimension of embeddings in transformer architectures, focusing on the dichotomy between encoders and decoders. Our findings reveal that the anisotropy profile in transformer decoders exhibits a distinct bell-shaped curve, with the highest anisotropy concentrations in the middle layers. This pattern diverges from the more uniformly distributed anisotropy observed in encoders. In addition, we found that the intrinsic dimension of embeddings increases in the initial phases of training, indicating an expansion into higher-dimensional space. Which is then followed by a compression phase towards the end of training with dimensionality decrease, suggesting a refinement into more compact representations. Our results provide fresh insights to the understanding of encoders and decoders embedding properties.
Paper Structure (19 sections, 7 equations, 5 figures, 1 table)

This paper contains 19 sections, 7 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Different anisotropy profiles for transformer-based encoders and decoders.
  • Figure 2: Anisotropy profile for Bloom-3B at different number of pretraining steps.
  • Figure 3: Anisotropy profile for Pythia-2.8B at different number of pretraining steps.
  • Figure 4: Intrinsic dimension averaged across layers at different pretraining steps.
  • Figure 5: Intrinsic dimension (ID) averages across layers at different pretraining steps estimated via 3 different algorithms.