Table of Contents
Fetching ...

Norm of Mean Contextualized Embeddings Determines their Variance

Hiroaki Yamagiwa, Hidetoshi Shimodaira

TL;DR

This work investigates the distribution of contextualized embeddings in Transformer models by focusing on the three statistics $Q(X_t)$, $M(X_t)$, and $V(X_t)$ and proving the identity $Q(X_t) = M(X_t) + V(X_t)$. It introduces a sequential, memory-efficient method to compute these statistics for token-wise sets $X_t$, and demonstrates a strong trade-off between $M(X_t)$ and $V(X_t)$ across intermediate layers, likely shaped by Layer Normalization. The analysis is extended to the full embedding set $X$, with a variance decomposition $V(X) = V_W(X) + V_B(X)$ that reveals growing anisotropy as depth increases. Empirically, across BERT, RoBERTa, and GPT-2 on BookCorpus-derived data, the work shows that deeper layers push embeddings farther from the origin (larger $M(X)$) while reducing overall variance (smaller $V(X)$) and shifting within-/between-cluster variance, with LN placement explaining differences between Pre-LN and Post-LN architectures. Overall, the paper provides interpretable, scalable metrics for embedding distributions that illuminate how contextualized representations evolve with depth and architecture.

Abstract

Contextualized embeddings vary by context, even for the same token, and form a distribution in the embedding space. To analyze this distribution, we focus on the norm of the mean embedding and the variance of the embeddings. In this study, we first demonstrate that these values follow the well-known formula for variance in statistics and provide an efficient sequential computation method. Then, by observing embeddings from intermediate layers of several Transformer models, we found a strong trade-off relationship between the norm and the variance: as the mean embedding becomes closer to the origin, the variance increases. This trade-off is likely influenced by the layer normalization mechanism used in Transformer models. Furthermore, when the sets of token embeddings are treated as clusters, we show that the variance of the entire embedding set can theoretically be decomposed into the within-cluster variance and the between-cluster variance. We found experimentally that as the layers of Transformer models deepen, the embeddings move farther from the origin, the between-cluster variance relatively decreases, and the within-cluster variance relatively increases. These results are consistent with existing studies on the anisotropy of the embedding spaces across layers.

Norm of Mean Contextualized Embeddings Determines their Variance

TL;DR

This work investigates the distribution of contextualized embeddings in Transformer models by focusing on the three statistics , , and and proving the identity . It introduces a sequential, memory-efficient method to compute these statistics for token-wise sets , and demonstrates a strong trade-off between and across intermediate layers, likely shaped by Layer Normalization. The analysis is extended to the full embedding set , with a variance decomposition that reveals growing anisotropy as depth increases. Empirically, across BERT, RoBERTa, and GPT-2 on BookCorpus-derived data, the work shows that deeper layers push embeddings farther from the origin (larger ) while reducing overall variance (smaller ) and shifting within-/between-cluster variance, with LN placement explaining differences between Pre-LN and Post-LN architectures. Overall, the paper provides interpretable, scalable metrics for embedding distributions that illuminate how contextualized representations evolve with depth and architecture.

Abstract

Contextualized embeddings vary by context, even for the same token, and form a distribution in the embedding space. To analyze this distribution, we focus on the norm of the mean embedding and the variance of the embeddings. In this study, we first demonstrate that these values follow the well-known formula for variance in statistics and provide an efficient sequential computation method. Then, by observing embeddings from intermediate layers of several Transformer models, we found a strong trade-off relationship between the norm and the variance: as the mean embedding becomes closer to the origin, the variance increases. This trade-off is likely influenced by the layer normalization mechanism used in Transformer models. Furthermore, when the sets of token embeddings are treated as clusters, we show that the variance of the entire embedding set can theoretically be decomposed into the within-cluster variance and the between-cluster variance. We found experimentally that as the layers of Transformer models deepen, the embeddings move farther from the origin, the between-cluster variance relatively decreases, and the within-cluster variance relatively increases. These results are consistent with existing studies on the anisotropy of the embedding spaces across layers.
Paper Structure (40 sections, 44 equations, 22 figures, 3 tables, 1 algorithm)

This paper contains 40 sections, 44 equations, 22 figures, 3 tables, 1 algorithm.

Figures (22)

  • Figure 1: Scatter plots of PCA-transformed embeddings for the embedding sets $X_t$ of selected tokens. The origin is indicated by $\times$. Tokens distributed near the origin exhibit larger variance, whereas tokens farther from the origin exhibit smaller variance. Embeddings are colored according to token frequency $n_t$.
  • Figure 2: Scatter plots of $V(X_t)$ against $M(X_t)$ for the middle-layer embeddings of six models with regression lines, slopes, and coefficients of determination, $R^2$. A consistent trade-off between $M(X_t)$ and $V(X_t)$ is observed in the intermediate layer of each model. A summary for all the other layers can be found in Fig. \ref{['fig:MV_slope_and_score']}. Only tokens with $1 \leq \log_{10} n_t \leq 5$ were used for regressions to reduce the influence of extreme values.
  • Figure 3: Illustration of the token-wise embedding sets $X_t$, $t\in T$, and the entire embedding set $X$. The values $\bm{\mu}(X_t)$, $M(X_t)$, and $V(X_t)$ are computed for each $X_t$, while $\bm{\mu}(X)$, $M(X)$, and $V(X)$ are for $X$. In addition, $V(X)$ is decomposed into the within-group variance $V_W(X)$ and the between-group variance $V_B(X)$. $V_W(X)$ is the frequency-weighted mean of $V(X_t)$, while $V_B(X)$ represents the spread of $\bm{\mu}(X_t)$ around $\bm{\mu}(X)$. Although $M$ and $V$ are illustrated as a norm and a standard deviation, respectively, they are actually the squared versions as shown in (\ref{['eq:MXt']}) and (\ref{['eq:VXt']}).
  • Figure 4: For each layer across the six models, the coefficient of variation (C.V.) of $Q(X_t)$ on the left, the slope of the regression line of $V(X_t)$ on $M(X_t)$ in the middle, and the corresponding coefficient of determination $R^2$ on the right are shown. For all models, the C.V. approximately reaches its minimum in the intermediate layers. Consequently, the slope and $R^2$ approximately reach their minimum and maximum, respectively, in the intermediate layers. Only tokens with $1 \leq \log_{10} n_t \leq 5$ were used to reduce the influence of extreme values.
  • Figure 5: The ratios of $M(X)$, $V_W(X)$, and $V_B(X)$, each normalized by $Q(X)$, for each layer across the six models. As the layers deepen, the ratio of $M(X)$ tends to exceed that of $V_W(X) + V_B(X) (=V(X))$. Meanwhile, the ratio of $V_W(X)$ increases relative to $V_B(X)$. Figure \ref{['fig:Vw_per_V']} shows detailed comparisons between $V_W(X)$ and $V_B(X)$. Further plots of the ratios of these values and those of the original values are shown in Figs. \ref{['fig:MX_VwX_VbX_per_QX']} and \ref{['fig:QX_MX_VX']}, respectively, in Appendix \ref{['app:X']}. Only tokens with $1 \leq \log_{10} n_t \leq 5$ were used to reduce the influence of extreme values.
  • ...and 17 more figures

Theorems & Definitions (7)

  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof