Table of Contents
Fetching ...

Distributed Mixture-of-Agents for Edge Inference with Large Language Models

Purbesh Mitra, Priyanka Kaswan, Sennur Ulukus

TL;DR

This work addresses edge-device inference by distributing a Mixture-of-Agents (MoA) across a network of LLM-enabled devices that gossip prompts and responses without a central server. The authors formalize a multi-layer, proposer-aggregator MoA and derive a queuing-stability condition, showing that stability requires $((k+1)M+1)\lambda < 1/\alpha_{\max}$ (for heterogeneous inference times $\alpha$) to bound queue growth. They validate the framework experimentally with open-source LLMs on AlpacaEval 2.0, demonstrating that increasing layers $M$ and proposers per layer $k$ can improve accuracy but also increases latency and average queue size, with diverse LLMs further boosting performance. The results support deploying distributed MoA on edge networks to achieve higher-quality responses while mitigating centralized server dependencies, guiding parameter choices for practical edge deployments in privacy- and latency-conscious environments.

Abstract

Mixture-of-Agents (MoA) has recently been proposed as a method to enhance performance of large language models (LLMs), enabling multiple individual LLMs to work together for collaborative inference. This collaborative approach results in improved responses to user prompts compared to relying on a single LLM. In this paper, we consider such an MoA architecture in a distributed setting, where LLMs operate on individual edge devices, each uniquely associated with a user and equipped with its own distributed computing power. These devices exchange information using decentralized gossip algorithms, allowing different device nodes to talk without the supervision of a centralized server. In the considered setup, different users have their own LLM models to address user prompts. Additionally, the devices gossip either their own user-specific prompts or augmented prompts to generate more refined answers to certain queries. User prompts are temporarily stored in the device queues when their corresponding LLMs are busy. Given the memory limitations of edge devices, it is crucial to ensure that the average queue sizes in the system remain bounded. In this paper, we address this by theoretically calculating the queuing stability conditions for the device queues under reasonable assumptions, which we validate experimentally as well. Further, we demonstrate through experiments, leveraging open-source LLMs for the implementation of distributed MoA, that certain MoA configurations produce higher-quality responses compared to others, as evaluated on AlpacaEval 2.0 benchmark. The implementation is available at: https://github.com/purbeshmitra/distributed_moa.

Distributed Mixture-of-Agents for Edge Inference with Large Language Models

TL;DR

This work addresses edge-device inference by distributing a Mixture-of-Agents (MoA) across a network of LLM-enabled devices that gossip prompts and responses without a central server. The authors formalize a multi-layer, proposer-aggregator MoA and derive a queuing-stability condition, showing that stability requires (for heterogeneous inference times ) to bound queue growth. They validate the framework experimentally with open-source LLMs on AlpacaEval 2.0, demonstrating that increasing layers and proposers per layer can improve accuracy but also increases latency and average queue size, with diverse LLMs further boosting performance. The results support deploying distributed MoA on edge networks to achieve higher-quality responses while mitigating centralized server dependencies, guiding parameter choices for practical edge deployments in privacy- and latency-conscious environments.

Abstract

Mixture-of-Agents (MoA) has recently been proposed as a method to enhance performance of large language models (LLMs), enabling multiple individual LLMs to work together for collaborative inference. This collaborative approach results in improved responses to user prompts compared to relying on a single LLM. In this paper, we consider such an MoA architecture in a distributed setting, where LLMs operate on individual edge devices, each uniquely associated with a user and equipped with its own distributed computing power. These devices exchange information using decentralized gossip algorithms, allowing different device nodes to talk without the supervision of a centralized server. In the considered setup, different users have their own LLM models to address user prompts. Additionally, the devices gossip either their own user-specific prompts or augmented prompts to generate more refined answers to certain queries. User prompts are temporarily stored in the device queues when their corresponding LLMs are busy. Given the memory limitations of edge devices, it is crucial to ensure that the average queue sizes in the system remain bounded. In this paper, we address this by theoretically calculating the queuing stability conditions for the device queues under reasonable assumptions, which we validate experimentally as well. Further, we demonstrate through experiments, leveraging open-source LLMs for the implementation of distributed MoA, that certain MoA configurations produce higher-quality responses compared to others, as evaluated on AlpacaEval 2.0 benchmark. The implementation is available at: https://github.com/purbeshmitra/distributed_moa.
Paper Structure (5 sections, 1 theorem, 6 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 5 sections, 1 theorem, 6 equations, 4 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

The individual queues in the distributed MoA setting remain stable if the following condition is satisfied:

Figures (4)

  • Figure 1: An example of decision-making with edge devices. The central server with large compute power is directly connected to the edge devices. However, it is spatially separated from the edge devices, incurring large communication delays. Further, it is also a single point of failure, which can go offline due to link failures or adversarial attacks. The collaborations among edge devices, however, provides more diverse connections to more compute power than individual edge devices, thus making the system more robust.
  • Figure 2: System model of distributed MoA. Each device has their local prompts and which they infer by their local LLMs. Simultaneously, these prompts are sent to a few of their neighboring LLMs for inference and their responses are then aggregated by the original LLM. This response generation and aggregation process can continue for multiple rounds, constituting multiple layers of the MoA.
  • Figure 3: An illustration of a 2-layer mixture-of-agents system. The original prompt is inferred via 3 proposer LLMs in each layer and is finally aggregated at a single aggregator LLM. From second layer onwards, the LLMs use the system prompt to generate refined outputs from the concatenated prompts.
  • Figure 4: An overview of user $i$ in distributed MoA. It has the LLM$_i$ in it, which sequentially infers prompts from the adjacent queue. The prompts are generated by user $i$ itself and also from its neighbour $j$. The output of the LLM is sent to the neighboring node for further assistance in inferring. The final output comes from aggregating over different responses at the LLM$_i$.

Theorems & Definitions (2)

  • Theorem 1
  • Remark 1