Table of Contents
Fetching ...

Training Sparse Mixture Of Experts Text Embedding Models

Zach Nussbaum, Brandon Duderstadt

TL;DR

The paper tackles deployment inefficiencies in dense text embedding models by applying sparse Mixture of Experts (MoE) to reduce active parameters without sacrificing performance. It introduces Nomic Embed v2, a general-purpose MoE text embedding model that extends context length and leverages weakly supervised contrastive pretraining, consistency filtering, and hard negative mining, plus Matryoshka representations for compact embeddings. Across monolingual and multilingual benchmarks, Nomic Embed v2 matches or outperforms similarly sized models and competes with larger models, while maintaining a significantly smaller active parameter footprint. The work provides open-source code, models, and evaluation data, offering a practical pathway to efficient, scalable retrieval systems in multilingual settings.

Abstract

Transformer-based text embedding models have improved their performance on benchmarks like MIRACL and BEIR by increasing their parameter counts. However, this scaling approach introduces significant deployment challenges, including increased inference latency and memory usage. These challenges are particularly severe in retrieval-augmented generation (RAG) applications, where large models' increased memory requirements constrain dataset ingestion capacity, and their higher latency directly impacts query-time performance. While causal language models have addressed similar efficiency challenges using Mixture of Experts (MoE) architectures, this approach hasn't been successfully adapted to the general text embedding setting. In this paper, we introduce Nomic Embed v2, the first general purpose MoE text embedding model. Our model outperforms models in the same parameter class on both monolingual and multilingual benchmarks while also maintaining competitive performance with models twice its size. We open-source all code, models, and evaluation data to ensure full reproducibility of our training pipeline at \href{https://github.com/nomic-ai/contrastors}{https://github.com/nomic-ai/contrastors}.

Training Sparse Mixture Of Experts Text Embedding Models

TL;DR

The paper tackles deployment inefficiencies in dense text embedding models by applying sparse Mixture of Experts (MoE) to reduce active parameters without sacrificing performance. It introduces Nomic Embed v2, a general-purpose MoE text embedding model that extends context length and leverages weakly supervised contrastive pretraining, consistency filtering, and hard negative mining, plus Matryoshka representations for compact embeddings. Across monolingual and multilingual benchmarks, Nomic Embed v2 matches or outperforms similarly sized models and competes with larger models, while maintaining a significantly smaller active parameter footprint. The work provides open-source code, models, and evaluation data, offering a practical pathway to efficient, scalable retrieval systems in multilingual settings.

Abstract

Transformer-based text embedding models have improved their performance on benchmarks like MIRACL and BEIR by increasing their parameter counts. However, this scaling approach introduces significant deployment challenges, including increased inference latency and memory usage. These challenges are particularly severe in retrieval-augmented generation (RAG) applications, where large models' increased memory requirements constrain dataset ingestion capacity, and their higher latency directly impacts query-time performance. While causal language models have addressed similar efficiency challenges using Mixture of Experts (MoE) architectures, this approach hasn't been successfully adapted to the general text embedding setting. In this paper, we introduce Nomic Embed v2, the first general purpose MoE text embedding model. Our model outperforms models in the same parameter class on both monolingual and multilingual benchmarks while also maintaining competitive performance with models twice its size. We open-source all code, models, and evaluation data to ensure full reproducibility of our training pipeline at \href{https://github.com/nomic-ai/contrastors}{https://github.com/nomic-ai/contrastors}.

Paper Structure

This paper contains 35 sections, 4 equations, 1 figure, 12 tables.

Figures (1)

  • Figure 1: Impact of Model Size and Batch Size on Retrieval Performance. NDCG@10 scores on BEIR benchmark across different batch sizes and model architectures. The upcycled MoE model's performance approaches that of a model with 3x more active parameters as batch size increases, demonstrating efficient scaling behavior.