Table of Contents
Fetching ...

Composition of Experts: A Modular Compound AI System Leveraging Large Language Models

Swayambhoo Jain, Ravi Raju, Bo Li, Zoltan Csaki, Jonathan Li, Kaizhao Liang, Guoyao Feng, Urmish Thakkar, Anand Sampat, Raghu Prabhakar, Sumati Jairath

TL;DR

This paper introduces the Composition of Experts (CoE), a modular compound AI system leveraging multiple expert LLMs that leverages a router to dynamically select the most appropriate expert for a given input, enabling efficient utilization of resources and improved performance.

Abstract

Large Language Models (LLMs) have achieved remarkable advancements, but their monolithic nature presents challenges in terms of scalability, cost, and customization. This paper introduces the Composition of Experts (CoE), a modular compound AI system leveraging multiple expert LLMs. CoE leverages a router to dynamically select the most appropriate expert for a given input, enabling efficient utilization of resources and improved performance. We formulate the general problem of training a CoE and discuss inherent complexities associated with it. We propose a two-step routing approach to address these complexities that first uses a router to classify the input into distinct categories followed by a category-to-expert mapping to obtain desired experts. CoE offers a flexible and cost-effective solution to build compound AI systems. Our empirical evaluation demonstrates the effectiveness of CoE in achieving superior performance with reduced computational overhead. Given that CoE comprises of many expert LLMs it has unique system requirements for cost-effective serving. We present an efficient implementation of CoE leveraging SambaNova SN40L RDUs unique three-tiered memory architecture. CoEs obtained using open weight LLMs Qwen/Qwen2-7B-Instruct, google/gemma-2-9b-it, google/gemma-2-27b-it, meta-llama/Llama-3.1-70B-Instruct and Qwen/Qwen2-72B-Instruct achieve a score of $59.4$ with merely $31$ billion average active parameters on Arena-Hard and a score of $9.06$ with $54$ billion average active parameters on MT-Bench.

Composition of Experts: A Modular Compound AI System Leveraging Large Language Models

TL;DR

This paper introduces the Composition of Experts (CoE), a modular compound AI system leveraging multiple expert LLMs that leverages a router to dynamically select the most appropriate expert for a given input, enabling efficient utilization of resources and improved performance.

Abstract

Large Language Models (LLMs) have achieved remarkable advancements, but their monolithic nature presents challenges in terms of scalability, cost, and customization. This paper introduces the Composition of Experts (CoE), a modular compound AI system leveraging multiple expert LLMs. CoE leverages a router to dynamically select the most appropriate expert for a given input, enabling efficient utilization of resources and improved performance. We formulate the general problem of training a CoE and discuss inherent complexities associated with it. We propose a two-step routing approach to address these complexities that first uses a router to classify the input into distinct categories followed by a category-to-expert mapping to obtain desired experts. CoE offers a flexible and cost-effective solution to build compound AI systems. Our empirical evaluation demonstrates the effectiveness of CoE in achieving superior performance with reduced computational overhead. Given that CoE comprises of many expert LLMs it has unique system requirements for cost-effective serving. We present an efficient implementation of CoE leveraging SambaNova SN40L RDUs unique three-tiered memory architecture. CoEs obtained using open weight LLMs Qwen/Qwen2-7B-Instruct, google/gemma-2-9b-it, google/gemma-2-27b-it, meta-llama/Llama-3.1-70B-Instruct and Qwen/Qwen2-72B-Instruct achieve a score of with merely billion average active parameters on Arena-Hard and a score of with billion average active parameters on MT-Bench.

Paper Structure

This paper contains 20 sections, 7 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Abstract representation of Composition of Experts (CoE) for a given set of expert LLMs $\mathcal{E} = \{ \mathbf{E}_1, \cdots, \mathbf{E}_K\}$. For a given subset of experts $\mathcal{E}_s \subseteq \mathcal{E}$, the $\textrm{CoE}(\mathbf{p};\mathcal{E}_s, R)$ routes the input prompt $\textbf{p}$ to one of the expert in $\mathcal{E}_s$ using the routing function $\textrm{R}(\mathbf{p})$ and produces the output by loading and the running that expert.
  • Figure 2: Abstract representation of Composition of Experts (CoE) with two step routing. For a given subset of experts $\mathcal{E}_s \subseteq \mathcal{E}$, the $\textrm{CoE}(\mathbf{p};\mathcal{E}_s, \textrm{CE}, \textrm{CR})$ routes the input prompt $\textbf{p}$ by first mapping it to one of the $M$ categories using category-router $\textrm{CR}(\mathbf{p})$ followed by category-to-expert mapping $\textrm{CE}\left( \textrm{CR}(\mathbf{p})\right)$ to choose a designated expert for that category.
  • Figure 3: 2D t-SNE plot for prompt-embeddings obtained from text-embedding model intfloat/e5-mistral-7b-instruct for prompts in the CoE training data. Labels based on best expert LLMs chosen from the expert set comprising of Qwen/Qwen2-7B-Instruct, google/gemma-2-9b-it, google/gemma-2-27b-it, meta-llama/Llama-3.1-70B-Instruct and Qwen/Qwen2-72B-Instruct. Best expert is obtained by using LLM-as-a-judge with details provided in Section \ref{['sec:experimental_setup']}.
  • Figure 4: 2D t-SNE plot for prompt-embeddings obtained from text-embedding model intfloat/e5-mistral-7b-instruct for prompts in the CoE training data. Labels are based on various categories comprising of variety of domains and languages.
  • Figure 5: A simplified sequence of operations for CoE serving via SN40L. Router weights are in HBM. Expert weights are in DDR, with a region pre-allocated in HBM for the "current" expert(s).
  • ...and 10 more figures