Table of Contents
Fetching ...

CCoE: A Compact and Efficient LLM Framework with Multi-Expert Collaboration for Resource-Limited Settings

Shaomang Huang, Jianfeng Pan, Min Peng, Hanzhong Zheng

TL;DR

CCoE tackles the challenge of deploying multiple domain-specific LLMs under resource constraints by unifying independently trained domain experts as subnetworks within a shared backbone. It formalizes a partitioned architecture where $L = l_b + l_i$ and outputs follow $O_i = E_{i,l_i}(E_{b,l_b}(q)) \cdot \mathbf{1}_{\mathcal{R}(q)=i}$, with training guided by per-expert losses $\mathcal{L}_{i} = - \sum_{t=1}^T \log p(y_t|y_{<t}; \theta^*_b; \theta_i)$. Two routing schemes—rule-based gating and expert planning—enable flexible, scalable collaboration among experts, including a planning component that uses scores $h^{(t)}_{j,i}$ to select best-matched experts. Across Math, Code, Law, Medical, and Text-to-SQL, CCoE matches or exceeds domain-specific LLMs while dramatically reducing memory usage and improving inference efficiency relative to multi-domain ensembles and parameter-efficient adapters, making it well suited for resource-limited deployments. The framework supports rapid expansion and knowledge updates through push-and-pop operations, promising practical applicability for real-world, cross-domain reasoning tasks.

Abstract

Large Language Models (LLMs) have achieved exceptional performance across diverse domains through training on massive datasets. However, scaling LLMs to support multiple downstream domain applications remains a significant challenge, especially under resource constraints. Existing approaches often struggle to balance performance across multiple domains with resource efficiency, limiting their broader applicability. To address this, we introduce the CCoE architecture, a modular framework that seamlessly integrates domain-specific experts into a unified LLM. By leveraging independently trained expert subnetworks on a shared backbone partition, CCoE achieves state-of-the-art performance while significantly reducing the resource requirements for multi-expert deployments. Furthermore, rule-based gating and expert planning in CCoE enable flexible task allocation, promoting expert collaboration to handle complex reasoning tasks. CCoE not only reduces inference costs but also provides a flexible and scalable solution for integrating domain expertise across diverse applications. Experiments on five domains demonstrate that CCoE achieves comparable performance to current domain-specific LLMs. Moreover, compared to existing multi-domain model ensemble methods, CCoE reduces memory usage by 61.3%, while improving inference efficiency by 0.76x over parameter-efficient multi-expert integration approaches.

CCoE: A Compact and Efficient LLM Framework with Multi-Expert Collaboration for Resource-Limited Settings

TL;DR

CCoE tackles the challenge of deploying multiple domain-specific LLMs under resource constraints by unifying independently trained domain experts as subnetworks within a shared backbone. It formalizes a partitioned architecture where and outputs follow , with training guided by per-expert losses . Two routing schemes—rule-based gating and expert planning—enable flexible, scalable collaboration among experts, including a planning component that uses scores to select best-matched experts. Across Math, Code, Law, Medical, and Text-to-SQL, CCoE matches or exceeds domain-specific LLMs while dramatically reducing memory usage and improving inference efficiency relative to multi-domain ensembles and parameter-efficient adapters, making it well suited for resource-limited deployments. The framework supports rapid expansion and knowledge updates through push-and-pop operations, promising practical applicability for real-world, cross-domain reasoning tasks.

Abstract

Large Language Models (LLMs) have achieved exceptional performance across diverse domains through training on massive datasets. However, scaling LLMs to support multiple downstream domain applications remains a significant challenge, especially under resource constraints. Existing approaches often struggle to balance performance across multiple domains with resource efficiency, limiting their broader applicability. To address this, we introduce the CCoE architecture, a modular framework that seamlessly integrates domain-specific experts into a unified LLM. By leveraging independently trained expert subnetworks on a shared backbone partition, CCoE achieves state-of-the-art performance while significantly reducing the resource requirements for multi-expert deployments. Furthermore, rule-based gating and expert planning in CCoE enable flexible task allocation, promoting expert collaboration to handle complex reasoning tasks. CCoE not only reduces inference costs but also provides a flexible and scalable solution for integrating domain expertise across diverse applications. Experiments on five domains demonstrate that CCoE achieves comparable performance to current domain-specific LLMs. Moreover, compared to existing multi-domain model ensemble methods, CCoE reduces memory usage by 61.3%, while improving inference efficiency by 0.76x over parameter-efficient multi-expert integration approaches.
Paper Structure (17 sections, 5 equations, 5 figures, 4 tables)

This paper contains 17 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Examples from two typical application scenarios to illustrate the proposed CCoE framework. $Q_a$ represents a case where routing is based on a pre-identified expert (black arrow), and $O_{a,1}$ denotes the response after consulting the medical expert. $Q_b$ is a query that requires expert planning (yellow arrow), and $O^{(t)}_{b,*}$ represents the final answer obtained after collaboration among multiple experts. Note that the layers of each expert network can be flexibly distributed within the shared LLM.
  • Figure 2: The distribution of data across five domains in our experimental corpus: Math, Code, Law, Medicine, and Text-to-SQL.
  • Figure 3: Comparison of GPU-MU and TPS between our CCoE framework, the MDME, and SLoRA approaches. Specifically, Our CCoE is based on a 7B backbone LLM, while MDME consists of five domain-specific models: MetaMath-7B yu2024metamath, Fuzi-Mingcha-6B rozière2024codellamaopenfoundation, Code LLaMA-Python-7B deng-etal-2023-syllogistic, Meditron-7B chen2023meditron, and RESDSQL-3B li2023resdsql. SLoRA uses the same backbone LLM as ours and performs inference with the support of Punica LoRA in vLLM.
  • Figure 4: The evaluation of the expert layer insertion strategies in our CCoE framework.
  • Figure 5: The evaluation of expert planning within the CCoE framework for complex tasks.