Table of Contents
Fetching ...

Experts are all you need: A Composable Framework for Large Language Model Inference

Shrihari Sridharan, Sourjya Roy, Anand Raghunathan, Kaushik Roy

TL;DR

Comp-LLM introduces a composable LLM inference framework that enables cross-expert collaboration through an explicit sub-query dependency graph. It decomposes queries into sub-queries, routes them to domain experts via embedding-based similarity, and schedules parallel execution with a runtime scheduler, followed by aggregating expert outputs into a final answer. Across MultiExpertQA-P and MultiExpertQA-All benchmarks, Comp-LLM achieves up to 11.01% accuracy gains over similarly sized monolithic LLMs while reducing model size by 1.67x–3.56x and delivering 1.1x–1.7x latency improvements over sequential sub-query processing. Compared with Mixture-of-Experts, Comp-LLM outperforms MoE baselines that lack cross-sub-query dependency modeling. The work demonstrates that structured cross-expert collaboration can boost reasoning performance while lowering memory and latency, particularly for static multi-hop tasks.

Abstract

Large Language Models (LLMs) have achieved state-of-the-art accuracies in a variety of natural language processing (NLP) tasks. However, this success comes at the cost of increased model sizes which leads to additional computational burden. Mixture of Experts (MoEs) overcome this bottleneck by decoupling model capacity from computation by only activating a subset of parameters or "experts". However, these models require joint pretraining of these experts along with the router and do not model multi-step reasoning. In contrast, multi-agent frameworks improve reasoning by decomposing complex problems into modular subtasks. However, these frameworks rely on sequential "plan--act--observe" loops, which introduce significant latency. Our work, Comp-LLM, addresses these challenges by introducing a composable inference framework that enables cross-expert collaboration via an explicit sub-query dependency graph. Comp-LLM consists of three components: (1) A Sub-query Generator that decomposes an input query, assigns each sub-query to an appropriate expert using embedding similarity, and constructs a dependency graph; (2) A Query Executor that processes nodes in the graph and identifies opportunities for parallelism based on dependencies and resource constraints; and (3) A Response Aggregator that synthesizes intermediate expert responses into a coherent final answer. Across several benchmarks, Comp-LLM achieves up to 11.01% accuracy improvement over monolithic LLMs of similar size, while offering 1.67x--3.56x reduction in model size with no significant degradation relative to the largest model in its family. Additionally, Comp-LLM provides 1.1x--1.7x latency improvement compared to sequential sub-query processing.

Experts are all you need: A Composable Framework for Large Language Model Inference

TL;DR

Comp-LLM introduces a composable LLM inference framework that enables cross-expert collaboration through an explicit sub-query dependency graph. It decomposes queries into sub-queries, routes them to domain experts via embedding-based similarity, and schedules parallel execution with a runtime scheduler, followed by aggregating expert outputs into a final answer. Across MultiExpertQA-P and MultiExpertQA-All benchmarks, Comp-LLM achieves up to 11.01% accuracy gains over similarly sized monolithic LLMs while reducing model size by 1.67x–3.56x and delivering 1.1x–1.7x latency improvements over sequential sub-query processing. Compared with Mixture-of-Experts, Comp-LLM outperforms MoE baselines that lack cross-sub-query dependency modeling. The work demonstrates that structured cross-expert collaboration can boost reasoning performance while lowering memory and latency, particularly for static multi-hop tasks.

Abstract

Large Language Models (LLMs) have achieved state-of-the-art accuracies in a variety of natural language processing (NLP) tasks. However, this success comes at the cost of increased model sizes which leads to additional computational burden. Mixture of Experts (MoEs) overcome this bottleneck by decoupling model capacity from computation by only activating a subset of parameters or "experts". However, these models require joint pretraining of these experts along with the router and do not model multi-step reasoning. In contrast, multi-agent frameworks improve reasoning by decomposing complex problems into modular subtasks. However, these frameworks rely on sequential "plan--act--observe" loops, which introduce significant latency. Our work, Comp-LLM, addresses these challenges by introducing a composable inference framework that enables cross-expert collaboration via an explicit sub-query dependency graph. Comp-LLM consists of three components: (1) A Sub-query Generator that decomposes an input query, assigns each sub-query to an appropriate expert using embedding similarity, and constructs a dependency graph; (2) A Query Executor that processes nodes in the graph and identifies opportunities for parallelism based on dependencies and resource constraints; and (3) A Response Aggregator that synthesizes intermediate expert responses into a coherent final answer. Across several benchmarks, Comp-LLM achieves up to 11.01% accuracy improvement over monolithic LLMs of similar size, while offering 1.67x--3.56x reduction in model size with no significant degradation relative to the largest model in its family. Additionally, Comp-LLM provides 1.1x--1.7x latency improvement compared to sequential sub-query processing.

Paper Structure

This paper contains 21 sections, 7 figures, 15 tables, 1 algorithm.

Figures (7)

  • Figure 1: F1 score vs Model Size for different configurations of OPT-base and OPT-FT (finetuned) opt on Expert-QA-P (2 experts) benchmark. The proposed Comp-LLM (3.65B) produces higher F-1 score compared to OPT-Base (13B) and OPT-FT (13B).
  • Figure 2: Comp-LLM framework improves reasoning capabilites of LLMs. It consists of three key components: Sub-query Generator, Query Executor and Response Aggregator
  • Figure 3: Sub-query Generator dataset generation
  • Figure 4: F1 score varying the training dataset size for Sub-query Generator.
  • Figure 5: Expert Router
  • ...and 2 more figures