Table of Contents
Fetching ...

DEPT: Decoupled Embeddings for Pre-training Language Models

Alex Iacob, Lorenzo Sani, Meghdad Kurmanji, William F. Shen, Xinchi Qiu, Dongqi Cai, Yan Gao, Nicholas D. Lane

TL;DR

This work tackles pre-training of language models under highly heterogeneous data mixtures by decoupling embeddings from the transformer body, enabling vocabulary-specific representations without a shared embedding space. It introduces DEPT, with three variants—GLOB, TRIM, and SPEC—that trade off embedding sharing against memory and communication efficiency, allowing vocabulary-agnostic federated pre-training. Empirical results show DEPT improves transformer-body generalization, training efficiency, and plasticity while achieving substantial reductions in embedding memory and inter-device communication, and enabling per-source vocabularies. The proposed framework supports multi-domain and multilingual settings, yields strong downstream task gains, and offers a scalable path toward vocabulary-agnostic federated pre-training for billion-scale models, albeit with SPEC requiring additional global embedding for inference. Overall, DEPT presents a principled approach to mitigating vocabulary dilution and negative interference, with practical benefits for cross-domain and cross-language model pre-training.

Abstract

Language Model pre-training uses broad data mixtures to enhance performance across domains and languages. However, training on such heterogeneous text corpora requires extensive and expensive efforts. Since these data sources vary significantly in lexical, syntactic, and semantic aspects, they cause negative interference or the ``curse of multilinguality''. To address these challenges we propose a communication-efficient pre-training framework, DEPT. Our method decouples embeddings from the transformer body while simultaneously training the latter on multiple data sources without requiring a shared vocabulary. DEPT can: (1) train robustly and effectively under significant data heterogeneity, (2) minimize token embedding parameters to only what the data source vocabulary requires, while cutting communication costs in direct proportion to both the communication frequency and the reduction in parameters, (3) enhance transformer body plasticity and generalization, improving both average perplexity (up to 20%) and downstream task performance, and (4) enable training with custom optimized vocabularies per data source. We demonstrate DEPT's potential via the first vocabulary-agnostic federated pre-training of billion-scale models, reducing communication costs by orders of magnitude and embedding memory by 4-5x.

DEPT: Decoupled Embeddings for Pre-training Language Models

TL;DR

This work tackles pre-training of language models under highly heterogeneous data mixtures by decoupling embeddings from the transformer body, enabling vocabulary-specific representations without a shared embedding space. It introduces DEPT, with three variants—GLOB, TRIM, and SPEC—that trade off embedding sharing against memory and communication efficiency, allowing vocabulary-agnostic federated pre-training. Empirical results show DEPT improves transformer-body generalization, training efficiency, and plasticity while achieving substantial reductions in embedding memory and inter-device communication, and enabling per-source vocabularies. The proposed framework supports multi-domain and multilingual settings, yields strong downstream task gains, and offers a scalable path toward vocabulary-agnostic federated pre-training for billion-scale models, albeit with SPEC requiring additional global embedding for inference. Overall, DEPT presents a principled approach to mitigating vocabulary dilution and negative interference, with practical benefits for cross-domain and cross-language model pre-training.

Abstract

Language Model pre-training uses broad data mixtures to enhance performance across domains and languages. However, training on such heterogeneous text corpora requires extensive and expensive efforts. Since these data sources vary significantly in lexical, syntactic, and semantic aspects, they cause negative interference or the ``curse of multilinguality''. To address these challenges we propose a communication-efficient pre-training framework, DEPT. Our method decouples embeddings from the transformer body while simultaneously training the latter on multiple data sources without requiring a shared vocabulary. DEPT can: (1) train robustly and effectively under significant data heterogeneity, (2) minimize token embedding parameters to only what the data source vocabulary requires, while cutting communication costs in direct proportion to both the communication frequency and the reduction in parameters, (3) enhance transformer body plasticity and generalization, improving both average perplexity (up to 20%) and downstream task performance, and (4) enable training with custom optimized vocabularies per data source. We demonstrate DEPT's potential via the first vocabulary-agnostic federated pre-training of billion-scale models, reducing communication costs by orders of magnitude and embedding memory by 4-5x.
Paper Structure (45 sections, 1 equation, 7 figures, 21 tables, 1 algorithm)

This paper contains 45 sections, 1 equation, 7 figures, 21 tables, 1 algorithm.

Figures (7)

  • Figure 1: Pipeline for DEPT variants: TRIM (top-right), GLOB (bottom-left), SPEC (bottom-right), with the STANDARD approach (top-left). The numbered pipeline steps proceed as follows: (1) text corpora are processed into a vocabulary and tokenizer (global for STANDARD, GLOB, and TRIM; global or personalized for SPEC); (2) corpora are tokenized into a pre-tokenized dataset; (3) WORKERS train the model on their pre-tokenized data; (4) partial training results are collected; (5) results are aggregated; (6) the new model is sent to WORKERS. Steps 3–6 repeat to convergence.
  • Figure 2: Activations and model norms of STANDARD (STD) training versus DEPT (avg $\pm$ min/max) for a $350$M model trained with identical local hyperparameters—prior to adjusting STD ($\tau=0$) and STD ($\tau=1$) (uniform and proportional sampling) to a lower learning rate. The OuterOpt of DEPT introduces regularization effects due to noise-injection DontUseLargeBatchesUseLocalSGD, meta-learning REPTILE characteristics, which constrain these sources meta_opt of model divergence.
  • Figure 3: Adaptation curves starting from a randomly initialized matrix. DEPT variants are always stable in their convergence, reaching the lowest perplexity for the full dataset and the out-of-distribution language (HI). It is also always the fastest to adapt, full results available in \ref{['fig:fed:mc4:125M:balanced:perplexity_ratio_full']}
  • Figure 4: Convergence plot of our $1.3$ billion model trained in a vocabulary agnostic federated fashion. For the initial rounds, we sample $4$ data sources out of $8$; after seeing most of the clients, we reduce the number to $2$. We make sure only to introduce EN later into the experiment.
  • Figure 5: Adaptation curves starting from a randomly initialized matrix. DEPT is always stable in its convergence, reaching the lowest perplexity for the pre-training distribution (MC4-FULL), for the lowest-resource languages in the distribution (SW), and for the two out-of-distribution languages (HI, DE). It is also always the fastest to adapt.
  • ...and 2 more figures