Table of Contents
Fetching ...

Bagging-Based Model Merging for Robust General Text Embeddings

Hengran Zhang, Keping Bi, Jiafeng Guo, Jiaming Zhang, Wenbo Yang, Daiting Shi, Xueqi Cheng

TL;DR

This paper analyzes how multi-task training strategies affect general-purpose text embeddings, revealing that fine-grained data interleaving via batch-level shuffling offers the strongest in-domain performance, while exposing limitations in OOD generalization and incremental learning. To address these, it introduces BOOM, a bagging-based robust model merging framework that trains multiple embeddings on sampled subsets and merges them into a single, inference-efficient model, with an extension for efficient incremental updates using lightweight retraining and merging. Empirical results across MTEB benchmarks show that BOOM improves both in-domain and out-of-domain performance over full-corpus batch-level shuffling and reduces training cost in continual learning scenarios. The work provides practical guidance for robust, scalable, and updatable general text embeddings and highlights future opportunities to tailor merging strategies to text-embedding tasks and downstream applications like retrieval-augmented generation.

Abstract

General-purpose text embedding models underpin a wide range of NLP and information retrieval applications, and are typically trained on large-scale multi-task corpora to encourage broad generalization. However, it remains unclear how different multi-task training strategies compare in practice, and how to efficiently adapt embedding models as new domains and data types continually emerge. In this work, we present a systematic study of multi-task training for text embeddings from two perspectives: data scheduling and model merging. We compare batch-level shuffling, sequential training variants, two-stage training, and multiple merging granularities, and find that simple batch-level shuffling consistently yields the strongest overall performance, suggesting that task conflicts are limited and training datasets are largely complementary. Despite its effectiveness, batch-level shuffling exhibits two practical limitations: suboptimal out-of-domain (OOD) generalization and poor suitability for incremental learning due to expensive full retraining. To address these issues, we propose Bagging-based rObust mOdel Merging (\modelname), which trains multiple embedding models on sampled subsets and merges them into a single model, improving robustness while retaining single-model inference efficiency. Moreover, \modelname naturally supports efficient incremental updates by training lightweight update models on new data with a small historical subset and merging them into the existing model. Experiments across diverse embedding benchmarks demonstrate that \modelname consistently improves both in-domain and OOD performance over full-corpus batch-level shuffling, while substantially reducing training cost in incremental learning settings.

Bagging-Based Model Merging for Robust General Text Embeddings

TL;DR

This paper analyzes how multi-task training strategies affect general-purpose text embeddings, revealing that fine-grained data interleaving via batch-level shuffling offers the strongest in-domain performance, while exposing limitations in OOD generalization and incremental learning. To address these, it introduces BOOM, a bagging-based robust model merging framework that trains multiple embeddings on sampled subsets and merges them into a single, inference-efficient model, with an extension for efficient incremental updates using lightweight retraining and merging. Empirical results across MTEB benchmarks show that BOOM improves both in-domain and out-of-domain performance over full-corpus batch-level shuffling and reduces training cost in continual learning scenarios. The work provides practical guidance for robust, scalable, and updatable general text embeddings and highlights future opportunities to tailor merging strategies to text-embedding tasks and downstream applications like retrieval-augmented generation.

Abstract

General-purpose text embedding models underpin a wide range of NLP and information retrieval applications, and are typically trained on large-scale multi-task corpora to encourage broad generalization. However, it remains unclear how different multi-task training strategies compare in practice, and how to efficiently adapt embedding models as new domains and data types continually emerge. In this work, we present a systematic study of multi-task training for text embeddings from two perspectives: data scheduling and model merging. We compare batch-level shuffling, sequential training variants, two-stage training, and multiple merging granularities, and find that simple batch-level shuffling consistently yields the strongest overall performance, suggesting that task conflicts are limited and training datasets are largely complementary. Despite its effectiveness, batch-level shuffling exhibits two practical limitations: suboptimal out-of-domain (OOD) generalization and poor suitability for incremental learning due to expensive full retraining. To address these issues, we propose Bagging-based rObust mOdel Merging (\modelname), which trains multiple embedding models on sampled subsets and merges them into a single model, improving robustness while retaining single-model inference efficiency. Moreover, \modelname naturally supports efficient incremental updates by training lightweight update models on new data with a small historical subset and merging them into the existing model. Experiments across diverse embedding benchmarks demonstrate that \modelname consistently improves both in-domain and OOD performance over full-corpus batch-level shuffling, while substantially reducing training cost in incremental learning settings.
Paper Structure (22 sections, 5 equations, 4 figures, 6 tables)

This paper contains 22 sections, 5 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Average performance (%) of general text embedding models trained with different proportions of the multi-task training set on in-domain and OOD evaluation sets.
  • Figure 2: (a) Difference between average joint and individual training losses for dataset pairs; (b) hierarchical clustering results.
  • Figure 3: Average performance (%) comparison on MTEB (Eng, v2) between models trained jointly and independently on three pairs of datasets.
  • Figure 4: The overall framework of BOOM.