Table of Contents
Fetching ...

MoIN: Mixture of Introvert Experts to Upcycle an LLM

Ajinkya Tejankar, KL Navaneet, Ujjawal Panchal, Kossar Pourahmadi, Hamed Pirsiavash

TL;DR

This work tackles upcycling pre-trained LLMs without full re-pretraining by partitioning pre-training data into semantically coherent topics and training lightweight adapters on top of a frozen base. The proposed Mixture of Introvert Experts (MoIN) uses single-expert routing per query to load the most relevant topic adapter, enabling highly parallel training and scalable inference. Two topic-modeling pipelines are explored (K-means for MoIN-5k and UMAP+HDBSCAN for MoIN-500), and experiments with TinyLlama as the base demonstrate competitive perplexity and downstream performance compared to larger pre-training, while using far fewer training resources. The findings suggest that independent, parallel adapter training coupled with simple routing can achieve much of the benefits of continued pre-training at a fraction of the cost, with practical implications for multi-GPU deployments and continual learning. Future directions include adaptive adapter ranks, retrieval-augmented generation via adapters, and dynamic creation of new experts to handle evolving data domains.

Abstract

The goal of this paper is to improve (upcycle) an existing large language model without the prohibitive requirements of continued pre-training of the full-model. The idea is to split the pre-training data into semantically relevant groups and train an expert on each subset. An expert takes the form of a lightweight adapter added on the top of a frozen base model. During inference, an incoming query is first routed to the most relevant expert which is then loaded onto the base model for the forward pass. Unlike typical Mixture of Experts (MoE) models, the experts in our method do not work with other experts for a single query. Hence, we dub them "introvert" experts. Freezing the base model and keeping the experts as lightweight adapters allows extreme parallelism during training and inference. Training of all experts can be done in parallel without any communication channels between them. Similarly, the inference can also be heavily parallelized by distributing experts on different GPUs and routing each request to the GPU containing its relevant expert. We implement a proof-of-concept version of this method and show the validity of our approach.

MoIN: Mixture of Introvert Experts to Upcycle an LLM

TL;DR

This work tackles upcycling pre-trained LLMs without full re-pretraining by partitioning pre-training data into semantically coherent topics and training lightweight adapters on top of a frozen base. The proposed Mixture of Introvert Experts (MoIN) uses single-expert routing per query to load the most relevant topic adapter, enabling highly parallel training and scalable inference. Two topic-modeling pipelines are explored (K-means for MoIN-5k and UMAP+HDBSCAN for MoIN-500), and experiments with TinyLlama as the base demonstrate competitive perplexity and downstream performance compared to larger pre-training, while using far fewer training resources. The findings suggest that independent, parallel adapter training coupled with simple routing can achieve much of the benefits of continued pre-training at a fraction of the cost, with practical implications for multi-GPU deployments and continual learning. Future directions include adaptive adapter ranks, retrieval-augmented generation via adapters, and dynamic creation of new experts to handle evolving data domains.

Abstract

The goal of this paper is to improve (upcycle) an existing large language model without the prohibitive requirements of continued pre-training of the full-model. The idea is to split the pre-training data into semantically relevant groups and train an expert on each subset. An expert takes the form of a lightweight adapter added on the top of a frozen base model. During inference, an incoming query is first routed to the most relevant expert which is then loaded onto the base model for the forward pass. Unlike typical Mixture of Experts (MoE) models, the experts in our method do not work with other experts for a single query. Hence, we dub them "introvert" experts. Freezing the base model and keeping the experts as lightweight adapters allows extreme parallelism during training and inference. Training of all experts can be done in parallel without any communication channels between them. Similarly, the inference can also be heavily parallelized by distributing experts on different GPUs and routing each request to the GPU containing its relevant expert. We implement a proof-of-concept version of this method and show the validity of our approach.

Paper Structure

This paper contains 11 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Training and Inference with Topic-wise Experts. Expert training involves two steps: (a) clustering the training data into semantically related subsets and (b) parallelly training a topic-wise expert model on each data subset. (c) At inference, the entire query is routed to the most appropriate expert using a simple nearest-neighbor search. (d) The expert is then loaded alongside the base model and the request is forwarded through the model.
  • Figure 2: LoRA-wise perplexity. Sorted perplexity values of all our trained LoRAs. Underperforming models can be easily identified and further trained for more iterations or with more data if available without affecting the performance of the well-performing ones.
  • Figure 3: Number of training documents per topic. We report the number of documents in the training set of each topic. While there are some outliers with extremely high or low amounts of training data, most topics contain more than $5000$ documents.