Table of Contents
Fetching ...

Uncertainty-driven Embedding Convolution

Sungjun Lim, Kangjun Noh, Youngjun Choi, Heeyoung Lee, Kyungwoo Song

TL;DR

Uncertainty-driven Embedding Convolution (UEC) tackles the lack of a universally dominant embedding by forming a principled, uncertainty-aware ensemble. It post-hoc converts deterministic embeddings into Gaussian representations via Laplace approximation, then combines them with query-specific coefficients that down-weight uncertain models, and finally scores similarity using an uncertainty-aware, near-distributional distance surrogate. Across multilingual benchmarks (MIRACL/MMTEB), UEC consistently improves retrieval, classification, and semantic similarity while providing well-calibrated uncertainty estimates and maintaining near-linear computational complexity. The approach delivers robust, adaptable embedding ensembles suitable for real-time, cross-domain NLP tasks and highlights future directions in extending uncertainty modeling to aleatoric, multimodal, and fairness-aware contexts.

Abstract

Text embeddings are essential components in modern NLP pipelines. Although numerous embedding models have been proposed, no single model consistently dominates across domains and tasks. This variability motivates the use of ensemble techniques to combine complementary strengths. However, most existing ensemble methods operate on deterministic embeddings and fail to account for model-specific uncertainty, limiting their robustness and reliability in downstream applications. To address these limitations, we propose Uncertainty-driven Embedding Convolution (UEC). UEC first transforms deterministic embeddings into probabilistic ones in a post-hoc manner. It then computes adaptive ensemble coefficients based on embedding uncertainty, derived from a principled surrogate-loss formulation. Additionally, UEC employs an uncertainty-aware similarity function that directly incorporates uncertainty into the similarity scoring, providing a theoretically grounded and efficient surrogate to distributional distances. Extensive experiments on diverse benchmarks demonstrate that UEC consistently improves both performance and robustness by leveraging principled uncertainty modeling.

Uncertainty-driven Embedding Convolution

TL;DR

Uncertainty-driven Embedding Convolution (UEC) tackles the lack of a universally dominant embedding by forming a principled, uncertainty-aware ensemble. It post-hoc converts deterministic embeddings into Gaussian representations via Laplace approximation, then combines them with query-specific coefficients that down-weight uncertain models, and finally scores similarity using an uncertainty-aware, near-distributional distance surrogate. Across multilingual benchmarks (MIRACL/MMTEB), UEC consistently improves retrieval, classification, and semantic similarity while providing well-calibrated uncertainty estimates and maintaining near-linear computational complexity. The approach delivers robust, adaptable embedding ensembles suitable for real-time, cross-domain NLP tasks and highlights future directions in extending uncertainty modeling to aleatoric, multimodal, and fairness-aware contexts.

Abstract

Text embeddings are essential components in modern NLP pipelines. Although numerous embedding models have been proposed, no single model consistently dominates across domains and tasks. This variability motivates the use of ensemble techniques to combine complementary strengths. However, most existing ensemble methods operate on deterministic embeddings and fail to account for model-specific uncertainty, limiting their robustness and reliability in downstream applications. To address these limitations, we propose Uncertainty-driven Embedding Convolution (UEC). UEC first transforms deterministic embeddings into probabilistic ones in a post-hoc manner. It then computes adaptive ensemble coefficients based on embedding uncertainty, derived from a principled surrogate-loss formulation. Additionally, UEC employs an uncertainty-aware similarity function that directly incorporates uncertainty into the similarity scoring, providing a theoretically grounded and efficient surrogate to distributional distances. Extensive experiments on diverse benchmarks demonstrate that UEC consistently improves both performance and robustness by leveraging principled uncertainty modeling.

Paper Structure

This paper contains 107 sections, 2 theorems, 90 equations, 7 figures, 19 tables.

Key Result

Theorem 1

Let $\mathbf{q}\sim\mathcal{N}(\boldsymbol{\mu}_\mathbf{q},\boldsymbol{\Sigma}_\mathbf{q})$ and $\mathbf{c}\sim\mathcal{N}(\boldsymbol{\mu}_\mathbf{c},\boldsymbol{\Sigma}_\mathbf{c})$ be two Gaussian embeddings with $\ell_2$-normalized mean vectors and diagonal covariances $\boldsymbol{\Sigma}_\math Hence, ranking by $\hat{s}$ induces the same ranking as minimizing $W_2^2$, up to $O(\varepsilon^2)

Figures (7)

  • Figure 1: Comparison of deterministic embedding ensemble and UEC. Deterministic ensemble (left) uniformly averages embeddings without considering their reliability, often leading to suboptimal decisions. In this example, both candidate embeddings contribute equally, resulting in an incorrect retrieval. In contrast, the proposed UEC (right) adjusts weights based on the uncertainty, assigning higher importance to the more reliable candidate and successfully retrieving the correct answer.
  • Figure 2: Overview of the UEC framework: UEC first transforms deterministic embeddings from multiple encoder models into probabilistic representations using Laplace approximation. These probabilistic embeddings are then adaptively combined by computing uncertainty-driven ensemble coefficients based on per-dimension variances. Finally, similarity is measured using an uncertainty-aware metric that accounts for both the mean and uncertainty of the ensembled embedding.
  • Figure 3: Performance on MIRACL Subset across ensemble methods. The oracle represents the upper bound by selecting the best language-specific model per language. UEC achieves performance comparable to the oracle and even surpasses in some cases, with particularly strong gains in AUC@10.
  • Figure 4: Heatmap of model-wise coefficients assigned by UEC per language. Each row corresponds to a language-specific input, and each column to an ensemble coefficient. UEC computes ensemble coefficients that are adaptively modulated by the uncertainty of each embedding.
  • Figure 5: Challenging case where only UEC correctly retrieves the positive passage, demonstrating its ability to leverage uncertainty.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Theorem 1: Bounded approximation to the squared 2-Wasserstein distance
  • Theorem 2: Affine approximation to Jeffreys divergence