Table of Contents
Fetching ...

HDEE: Heterogeneous Domain Expert Ensemble

Oğuzhan Ersoy, Jari Kolehmainen, Gabriel Passamani Andrade

TL;DR

The paper tackles the cost and bottleneck of centralized dense LLM training by proposing heterogeneous domain expert ensembles (ELMForests) trained via Branch-Train-Merge (BTM). It analyzes how varying domain-specific model sizes and training iterations across domains affects perplexity, comparing three configurations: $M_\text{Ho}$-$I_\text{Ho}$, $M_\text{Ho}$-$I_\text{He}$, and $M_\text{He}$-$I_\text{Ho}$. Across 21 data domains (trainable and evaluation-only), heterogeneous ensembles achieve the lowest perplexities in $20$ cases, with $M_\text{Ho}$-$I_\text{He}$ often delivering the strongest results, particularly on harder domains; higher heterogeneity generally improves performance. The work demonstrates that independent, parallel training of domain experts with heterogeneity can outperform homogeneous baselines under fixed compute budgets and motivates future exploration of additional domain-specific heterogeneities and MoE integration for inference efficiency. This approach has practical implications for scalable, cost-effective training and deployment of domain-aware language models. $p(X_t|\mathbf{x}_{<t})$ and domain-posterior ensembling underpin the inference, enabling effective mix of domain specialists without centralized data consolidation.

Abstract

Training dense LLMs requires enormous amounts of data and centralized compute, which introduces fundamental bottlenecks and ever-growing costs for large models. Several studies aim to reduce this dependency on centralization by reducing the communication overhead of training dense models. Taking this idea of reducing communication overhead to a natural extreme, by training embarrassingly parallelizable ensembles of small independent experts, has been shown to outperform large dense models trained in traditional centralized settings. However, existing studies do not take into account underlying differences amongst data domains and treat them as monolithic, regardless of their underlying complexity, size, or distribution. In this paper, we explore the effects of introducing heterogeneity to these ensembles of domain expert models. Specifically, by allowing models within the ensemble to vary in size--as well as the number of training steps taken depending on the training data's domain--we study the effect heterogeneity has on these ensembles when evaluated against domains included in, and excluded from, the training set. We use the same compute budget to train heterogeneous ensembles and homogeneous baselines for comparison. We show that the heterogeneous ensembles achieve the lowest perplexity scores in $20$ out of the $21$ data domains used in the evaluation. Our code is available at https://github.com/gensyn-ai/hdee.

HDEE: Heterogeneous Domain Expert Ensemble

TL;DR

The paper tackles the cost and bottleneck of centralized dense LLM training by proposing heterogeneous domain expert ensembles (ELMForests) trained via Branch-Train-Merge (BTM). It analyzes how varying domain-specific model sizes and training iterations across domains affects perplexity, comparing three configurations: -, -, and -. Across 21 data domains (trainable and evaluation-only), heterogeneous ensembles achieve the lowest perplexities in cases, with - often delivering the strongest results, particularly on harder domains; higher heterogeneity generally improves performance. The work demonstrates that independent, parallel training of domain experts with heterogeneity can outperform homogeneous baselines under fixed compute budgets and motivates future exploration of additional domain-specific heterogeneities and MoE integration for inference efficiency. This approach has practical implications for scalable, cost-effective training and deployment of domain-aware language models. and domain-posterior ensembling underpin the inference, enabling effective mix of domain specialists without centralized data consolidation.

Abstract

Training dense LLMs requires enormous amounts of data and centralized compute, which introduces fundamental bottlenecks and ever-growing costs for large models. Several studies aim to reduce this dependency on centralization by reducing the communication overhead of training dense models. Taking this idea of reducing communication overhead to a natural extreme, by training embarrassingly parallelizable ensembles of small independent experts, has been shown to outperform large dense models trained in traditional centralized settings. However, existing studies do not take into account underlying differences amongst data domains and treat them as monolithic, regardless of their underlying complexity, size, or distribution. In this paper, we explore the effects of introducing heterogeneity to these ensembles of domain expert models. Specifically, by allowing models within the ensemble to vary in size--as well as the number of training steps taken depending on the training data's domain--we study the effect heterogeneity has on these ensembles when evaluated against domains included in, and excluded from, the training set. We use the same compute budget to train heterogeneous ensembles and homogeneous baselines for comparison. We show that the heterogeneous ensembles achieve the lowest perplexity scores in out of the data domains used in the evaluation. Our code is available at https://github.com/gensyn-ai/hdee.

Paper Structure

This paper contains 12 sections, 3 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: An iteration of BTM-style domain training in HDEE. In $\texttt{M}_\texttt{Ho}$-$\texttt{I}_\texttt{Ho}$ all models are the same size and are trained for the same number of steps. In $\texttt{M}_\texttt{Ho}$-$\texttt{I}_\texttt{He}$ all models are the same size, but are trained for more or fewer steps depending on the data domain. In $\texttt{M}_\texttt{He}$-$\texttt{I}_\texttt{Ho}$ models are different sizes depending on the data domain they will specialize in, but they are all trained for the same number of steps.