Table of Contents
Fetching ...

Bench-CoE: a Framework for Collaboration of Experts from Benchmark

Yuanshuai Wang, Xingjian Zhang, Jinkun Zhao, Siwei Wen, Peilin Feng, Shuhao Liao, Lei Huang, Wenjun Wu

TL;DR

Bench-CoE introduces a benchmark-driven framework to enable Collaboration of Experts by routing tasks among multiple LLM/LMM experts. It proposes two routing paradigms—query-level and subject-level—trained from benchmark evaluations to select the most capable expert per input or per subject, respectively. Across language and multimodal benchmarks under naive, in-distribution, and out-of-distribution scenarios, Bench-CoE consistently outperforms single models and, in some cases, larger LLM baselines, while incurring minimal additional training or labeling costs. This framework provides a scalable, interpretable baseline for integrating diverse expert models and guiding future routing strategies in multi-task AI systems.

Abstract

Large Language Models (LLMs) are key technologies driving intelligent systems to handle multiple tasks. To meet the demands of various tasks, an increasing number of LLMs-driven experts with diverse capabilities have been developed, accompanied by corresponding benchmarks to evaluate their performance. This paper proposes the Bench-CoE framework, which enables Collaboration of Experts (CoE) by effectively leveraging benchmark evaluations to achieve optimal performance across various tasks. Bench-CoE includes a set of expert models, a router for assigning tasks to corresponding experts, and a benchmark dataset for training the router. Moreover, we formulate Query-Level and Subject-Level approaches based on our framework, and analyze the merits and drawbacks of these two approaches. Finally, we conduct a series of experiments with vary data distributions on both language and multimodal tasks to validate that our proposed Bench-CoE outperforms any single model in terms of overall performance. We hope this method serves as a baseline for further research in this area. The code is available at \url{https://github.com/ZhangXJ199/Bench-CoE}.

Bench-CoE: a Framework for Collaboration of Experts from Benchmark

TL;DR

Bench-CoE introduces a benchmark-driven framework to enable Collaboration of Experts by routing tasks among multiple LLM/LMM experts. It proposes two routing paradigms—query-level and subject-level—trained from benchmark evaluations to select the most capable expert per input or per subject, respectively. Across language and multimodal benchmarks under naive, in-distribution, and out-of-distribution scenarios, Bench-CoE consistently outperforms single models and, in some cases, larger LLM baselines, while incurring minimal additional training or labeling costs. This framework provides a scalable, interpretable baseline for integrating diverse expert models and guiding future routing strategies in multi-task AI systems.

Abstract

Large Language Models (LLMs) are key technologies driving intelligent systems to handle multiple tasks. To meet the demands of various tasks, an increasing number of LLMs-driven experts with diverse capabilities have been developed, accompanied by corresponding benchmarks to evaluate their performance. This paper proposes the Bench-CoE framework, which enables Collaboration of Experts (CoE) by effectively leveraging benchmark evaluations to achieve optimal performance across various tasks. Bench-CoE includes a set of expert models, a router for assigning tasks to corresponding experts, and a benchmark dataset for training the router. Moreover, we formulate Query-Level and Subject-Level approaches based on our framework, and analyze the merits and drawbacks of these two approaches. Finally, we conduct a series of experiments with vary data distributions on both language and multimodal tasks to validate that our proposed Bench-CoE outperforms any single model in terms of overall performance. We hope this method serves as a baseline for further research in this area. The code is available at \url{https://github.com/ZhangXJ199/Bench-CoE}.

Paper Structure

This paper contains 35 sections, 15 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: The framework of Bench-CoE. Our Bench-CoE Framework directly trains the router based on the benchmark, utilizing either subject-level or query-level labels for task assignment. This approach enables Bench-CoE to seamlessly integrate multiple expert models without incurring additional training costs, while simultaneously enhancing task performance.
  • Figure 2: Comparison of routing methods in LLMs combination: (a) The MoE model utilizes multiple FFNs as expert modules during inference. (b) The Parallel-Inference-CoE model requires each query to pass through all experts during inference. (c) Although only the best expert is selected for inference, all expert models need to be tested during training to obtain labels. (d) Our Bench-CoE model only uses benchmark evaluation information to generate labels during router training, without extra costs, and uses only the best expert during inference.
  • Figure 3: Performance Across Subjects on MMLU Pro. Bench-CoE (Query-Level) outperforms all other models comprehensively. Bench-CoE (Subject-Level) achieves performance comparable to the top MoE model, Gemma-2-9b-it, and outperforms it in certain subjects.
  • Figure 4: Performance Across Subjects on MMMU. Bench-CoE (Subject-Level) achieves significantly superior performance across almost all subjects.
  • Figure 5: The performance of Llama-3-70B and Bench-CoE on each subject of the MMLU-Pro benchmark. Bench-CoE (Subject-Level) achieves performance comparable to Llama-3-70B, while Bench-CoE (Query-Level) surpasses Llama-3-70B.
  • ...and 1 more figures