Table of Contents
Fetching ...

Geometry-Preserving Aggregation for Mixture-of-Experts Embedding Models

Sajjad Kachuee, Mohammad Sharifkhani

TL;DR

Spherical Barycentric Aggregation (SBA) is introduced as a geometry-preserving aggregation operator that separates radial and angular components to maintain hyperspherical structure while remaining fully compatible with existing routing mechanisms.

Abstract

Mixture-of-Experts (MoE) embedding models combine expert outputs using weighted linear summation, implicitly assuming a linear subspace structure in the embedding space. This assumption is shown to be inconsistent with the geometry of expert representations. Geometric analysis of a modern MoE embedding model reveals that expert outputs lie on a shared hyperspherical manifold characterized by tightly concentrated norms and substantial angular separation. Under this geometry, linear aggregation induces inward collapse toward the manifold interior, distorting vector magnitude and direction and reducing embedding comparability. To address this inconsistency, Spherical Barycentric Aggregation (SBA) is introduced as a geometry-preserving aggregation operator that separates radial and angular components to maintain hyperspherical structure while remaining fully compatible with existing routing mechanisms. Experiments on selected tasks from the Massive Text Embedding Benchmark (MTEB), including semantic similarity, clustering, and duplicate question detection, demonstrate consistent performance improvements with identical training cost and full stability. Additional geometric analyses confirm that SBA prevents aggregation-induced collapse and preserves hyperspherical consistency, highlighting the importance of geometry-aware aggregation in MoE embedding architectures.

Geometry-Preserving Aggregation for Mixture-of-Experts Embedding Models

TL;DR

Spherical Barycentric Aggregation (SBA) is introduced as a geometry-preserving aggregation operator that separates radial and angular components to maintain hyperspherical structure while remaining fully compatible with existing routing mechanisms.

Abstract

Mixture-of-Experts (MoE) embedding models combine expert outputs using weighted linear summation, implicitly assuming a linear subspace structure in the embedding space. This assumption is shown to be inconsistent with the geometry of expert representations. Geometric analysis of a modern MoE embedding model reveals that expert outputs lie on a shared hyperspherical manifold characterized by tightly concentrated norms and substantial angular separation. Under this geometry, linear aggregation induces inward collapse toward the manifold interior, distorting vector magnitude and direction and reducing embedding comparability. To address this inconsistency, Spherical Barycentric Aggregation (SBA) is introduced as a geometry-preserving aggregation operator that separates radial and angular components to maintain hyperspherical structure while remaining fully compatible with existing routing mechanisms. Experiments on selected tasks from the Massive Text Embedding Benchmark (MTEB), including semantic similarity, clustering, and duplicate question detection, demonstrate consistent performance improvements with identical training cost and full stability. Additional geometric analyses confirm that SBA prevents aggregation-induced collapse and preserves hyperspherical consistency, highlighting the importance of geometry-aware aggregation in MoE embedding architectures.
Paper Structure (35 sections, 12 equations, 5 figures, 2 tables)

This paper contains 35 sections, 12 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Distribution of relative expert output norm ratios ($r_1 / r_2$) across all MoE layers and MMLU skill categories. The tight concentration around 1 demonstrates that active experts produce embeddings with nearly identical radial magnitude, indicating that expert representations lie on a shared hyperspherical manifold.
  • Figure 2: Distribution of pairwise angular distances between active expert outputs across all MoE layers and MMLU skill categories. The majority of angular separations exceed $40^\circ$, demonstrating strong directional specialization among experts despite shared radial magnitude.
  • Figure 3: Conceptual illustration of geometric collapse under linear aggregation. Two expert embeddings lie on a shared hypersphere with large angular separation. Linear weighted aggregation collapses inward, while Spherical Barycentric Aggregation (SBA) preserves hyperspherical geometry.
  • Figure 4: Integration of Spherical Barycentric Aggregation (SBA) into a standard Mixture-of-Experts layer. SBA replaces linear weighted summation with a geometry-preserving aggregation operator while leaving routing, expert computation, and training procedures unchanged.
  • Figure 5: Distribution of relative output norm ratios $\|y\|/\bar{r}$ for linearly aggregated and SBA-aggregated embeddings. Linear aggregation exhibits inward collapse toward the hypersphere center, whereas SBA preserves hyperspherical magnitude and maintains manifold consistency.