Table of Contents
Fetching ...

Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB

Xingyu Ren, Youran Sun, Haoyu Liang

TL;DR

Text embedding models exhibit a corpus-level mean bias $μ$ that creates embedding space anisotropy. The authors propose Renormalization, a training-free, plug-and-play post-processing with two variants, R1 and R2, where R2 projects away the mean direction before normalization to yield superior performance. Across 38 MMTEB models, renormalization delivers substantial gains, especially in retrieval ($9.7σ$) and classification ($3.1σ$), with improvements correlating positively with $\|μ\|$ and R2 outperforming R1. The method is lightweight, model-agnostic, and readily deployable to reduce embedding anisotropy and boost downstream tasks in real systems, notably retrieval and classification.

Abstract

We find that current text embedding models produce outputs with a consistent bias, i.e., each embedding vector $e$ can be decomposed as $\tilde{e} + μ$, where $μ$ is almost identical across all sentences. We propose a plug-and-play, training-free and lightweight solution called Renormalization. Through extensive experiments, we show that renormalization consistently and statistically significantly improves the performance of existing models on the Massive Multilingual Text Embedding Benchmark (MMTEB). In particular, across 38 models, renormalization improves performance by 9.7 $σ$ on retrieval tasks, 3.1 $σ$ on classification tasks, and 0.8 $σ$ on other types of tasks. Renormalization has two variants: directly subtracting $μ$ from $e$, or subtracting the projection of $e$ onto $μ$. We theoretically predict that the latter performs better, and our experiments confirm this prediction.

Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB

TL;DR

Text embedding models exhibit a corpus-level mean bias that creates embedding space anisotropy. The authors propose Renormalization, a training-free, plug-and-play post-processing with two variants, R1 and R2, where R2 projects away the mean direction before normalization to yield superior performance. Across 38 MMTEB models, renormalization delivers substantial gains, especially in retrieval () and classification (), with improvements correlating positively with and R2 outperforming R1. The method is lightweight, model-agnostic, and readily deployable to reduce embedding anisotropy and boost downstream tasks in real systems, notably retrieval and classification.

Abstract

We find that current text embedding models produce outputs with a consistent bias, i.e., each embedding vector can be decomposed as , where is almost identical across all sentences. We propose a plug-and-play, training-free and lightweight solution called Renormalization. Through extensive experiments, we show that renormalization consistently and statistically significantly improves the performance of existing models on the Massive Multilingual Text Embedding Benchmark (MMTEB). In particular, across 38 models, renormalization improves performance by 9.7 on retrieval tasks, 3.1 on classification tasks, and 0.8 on other types of tasks. Renormalization has two variants: directly subtracting from , or subtracting the projection of onto . We theoretically predict that the latter performs better, and our experiments confirm this prediction.

Paper Structure

This paper contains 20 sections, 11 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Comparison of two renormalization methods across models, sorted by mean vector norm (labeled after the model name). The violin shows the distribution of task score difference $\Delta$ before and after the renormalization. The median (red bar), quaters and upper and lower adjacent values are marked in the violin. The y-axis is properly scaled by logarithm to better present the distribution. The models in the same company use the same color, while in different companies use different colors.
  • Figure 2: Correlation between model mean vector norm and renormalization effectiveness. The y-axis is the proportion of tasks with significant improvements $>2\sigma$ for each model in Method R2. We can see that the effectiveness of renormalization has a positive correlation with $\lVert \mu \rVert$.
  • Figure 3: Correlation between model mean vector norm and renormalization effectiveness. The y-axis is the task score difference $\Delta$ properly scaled by logarithm and the error bar is the sigma calculated in Table \ref{['tab:model_performance']}. We try to fit these scattered points by the line and shadow to see that these two are positively related.