Table of Contents
Fetching ...

Randomly Removing 50% of Dimensions in Text Embeddings has Minimal Impact on Retrieval and Classification Tasks

Sotaro Takeshita, Yurina Takeshita, Daniel Ruffinelli, Simone Paolo Ponzetto

TL;DR

The paper investigates why randomly removing up to 50% of embedding dimensions minimally affects downstream retrieval and classification, a pattern observed across 6 encoders and 26 tasks and related to truncations in language models. It combines empirical truncation experiments with analyses of anisotropy, dimensional collapse, outlier dimensions, and per-dimension attribution to understand the robustness. The findings reveal a substantial set of degrading dimensions that are broadly distributed, explaining why random removals cancel positive and negative contributions, with PCA offering similar benefits to random truncation. The work suggests inefficiencies in current representation spaces and points to opportunities for training objectives or architectures that reduce degrading dimensions, potentially enabling more compact representations without sacrificing performance for retrieval and classification tasks.

Abstract

In this paper, we study the surprising impact that truncating text embeddings has on downstream performance. We consistently observe across 6 state-of-the-art text encoders and 26 downstream tasks, that randomly removing up to 50% of embedding dimensions results in only a minor drop in performance, less than 10%, in retrieval and classification tasks. Given the benefits of using smaller-sized embeddings, as well as the potential insights about text encoding, we study this phenomenon and find that, contrary to what is suggested in prior work, this is not the result of an ineffective use of representation space. Instead, we find that a large number of uniformly distributed dimensions actually cause an increase in performance when removed. This would explain why, on average, removing a large number of embedding dimensions results in a marginal drop in performance. We make similar observations when truncating the embeddings used by large language models to make next-token predictions on generative tasks, suggesting that this phenomenon is not isolated to classification or retrieval tasks.

Randomly Removing 50% of Dimensions in Text Embeddings has Minimal Impact on Retrieval and Classification Tasks

TL;DR

The paper investigates why randomly removing up to 50% of embedding dimensions minimally affects downstream retrieval and classification, a pattern observed across 6 encoders and 26 tasks and related to truncations in language models. It combines empirical truncation experiments with analyses of anisotropy, dimensional collapse, outlier dimensions, and per-dimension attribution to understand the robustness. The findings reveal a substantial set of degrading dimensions that are broadly distributed, explaining why random removals cancel positive and negative contributions, with PCA offering similar benefits to random truncation. The work suggests inefficiencies in current representation spaces and points to opportunities for training objectives or architectures that reduce degrading dimensions, potentially enabling more compact representations without sacrificing performance for retrieval and classification tasks.

Abstract

In this paper, we study the surprising impact that truncating text embeddings has on downstream performance. We consistently observe across 6 state-of-the-art text encoders and 26 downstream tasks, that randomly removing up to 50% of embedding dimensions results in only a minor drop in performance, less than 10%, in retrieval and classification tasks. Given the benefits of using smaller-sized embeddings, as well as the potential insights about text encoding, we study this phenomenon and find that, contrary to what is suggested in prior work, this is not the result of an ineffective use of representation space. Instead, we find that a large number of uniformly distributed dimensions actually cause an increase in performance when removed. This would explain why, on average, removing a large number of embedding dimensions results in a marginal drop in performance. We make similar observations when truncating the embeddings used by large language models to make next-token predictions on generative tasks, suggesting that this phenomenon is not isolated to classification or retrieval tasks.

Paper Structure

This paper contains 38 sections, 13 figures, 6 tables.

Figures (13)

  • Figure 1: (top) Regardless of the selection, removing 50% of embedding dimensions results in less than 5% performance drop. (bottom) This seems related to is due to many dimensions that lower performance (depicted in blue).
  • Figure 2: Relative performance when (a, b) last and (c, d) random K% of dimensions are removed. Error bars in (c, d) are drawn from the results of ten different random removals. The results per dataset are shown in Fig. \ref{['fig:k-truncation-last-beir-datasets']}, \ref{['fig:k-truncation-last-mteb-datasets']}.
  • Figure 3: As a result of contrastive learning for T5, downstream task performance increases (a: Full), and the use of embedding space measured through Uniform Loss ($\downarrow$) and IsoScore ($\uparrow$) for anisotropy (b, c: top) and Corr Mean ($\downarrow$) for dimensional collapse (d: top) also improves. However, the relative performance does not change over the training (a: Relative), therefore, there is no strong correlation between relative performance and representation quality measures (b, c, d: bottom).
  • Figure 4: Average performance on all datasets in the NanoBEIR and MTEB benchmarks after removing each dimension in the input embeddings. The red horizontal line indicates the performance achieved by the original embedding, and each point is the performance without the corresponding dimension. Blue points indicate that they are negatively impacting the performance as they are above the red line.
  • Figure 5: As we remove the degrading dimensions (blue plot), the relative performance for Sentence-T5 (figure on right) improves over the original embeddings. For Contriever (figure on left), while we do not see the improvements, however, the decay is slower than the last-k truncation. On the other hand, when only the improving dimensions are removed (orange plot), the performance decreases rapidly for both models. Results for other models are shown in Fig. \ref{['fig:truncate-dd-id-only-others']}.
  • ...and 8 more figures