Table of Contents
Fetching ...

Model-Distributed Inference for Large Language Models at the Edge

Davide Macario, Hulya Seferoglu, Erdem Koyuncu

TL;DR

This work tackles the challenge of running large language models on edge devices by distributing the model across multiple low-power nodes. It introduces recurrent pipeline parallelism to overlap computation across samples and reduce idle time, and augments the framework with KV caching and Grouped Query Attention to maintain high throughput in a distributed, autoregressive setting. The proposed MDI-LLM framework defines a ring network of starter and secondary nodes, with a tailored model-partitioning strategy that balances compute and minimizes inter-node communication. Empirical results on edge hardware demonstrate throughput gains and per-device memory reductions as more devices participate, enabling LLM inference on hardware that would not support a monolithic deployment. This work lays a foundation for scalable, edge-based deployment of transformer-based models and suggests directions for further optimization in multi-device, resource-constrained environments.

Abstract

We introduce Model-Distributed Inference for Large-Language Models (MDI-LLM), a novel framework designed to facilitate the deployment of state-of-the-art large-language models (LLMs) across low-power devices at the edge. This is accomplished by dividing the model into multiple partitions, which are then assigned to different devices/nodes within the network. These nodes exchange intermediate activation vectors via device-to-device links, enabling collaborative computation. To enhance the efficiency of this process, we propose the "recurrent pipeline parallelism" technique, which reduces idle time on each device and facilitates parallel inference during the generation of multiple text sequences. By leveraging the combined computational resources of multiple edge devices, MDI-LLM enables the deployment of LLMs that exceed the memory capacity of individual devices, making it possible to perform inference on low-cost hardware. Furthermore, as the number of participating devices increases, MDI-LLM boosts token generation throughput and reduces memory consumption per device.

Model-Distributed Inference for Large Language Models at the Edge

TL;DR

This work tackles the challenge of running large language models on edge devices by distributing the model across multiple low-power nodes. It introduces recurrent pipeline parallelism to overlap computation across samples and reduce idle time, and augments the framework with KV caching and Grouped Query Attention to maintain high throughput in a distributed, autoregressive setting. The proposed MDI-LLM framework defines a ring network of starter and secondary nodes, with a tailored model-partitioning strategy that balances compute and minimizes inter-node communication. Empirical results on edge hardware demonstrate throughput gains and per-device memory reductions as more devices participate, enabling LLM inference on hardware that would not support a monolithic deployment. This work lays a foundation for scalable, edge-based deployment of transformer-based models and suggests directions for further optimization in multi-device, resource-constrained environments.

Abstract

We introduce Model-Distributed Inference for Large-Language Models (MDI-LLM), a novel framework designed to facilitate the deployment of state-of-the-art large-language models (LLMs) across low-power devices at the edge. This is accomplished by dividing the model into multiple partitions, which are then assigned to different devices/nodes within the network. These nodes exchange intermediate activation vectors via device-to-device links, enabling collaborative computation. To enhance the efficiency of this process, we propose the "recurrent pipeline parallelism" technique, which reduces idle time on each device and facilitates parallel inference during the generation of multiple text sequences. By leveraging the combined computational resources of multiple edge devices, MDI-LLM enables the deployment of LLMs that exceed the memory capacity of individual devices, making it possible to perform inference on low-cost hardware. Furthermore, as the number of participating devices increases, MDI-LLM boosts token generation throughput and reduces memory consumption per device.

Paper Structure

This paper contains 13 sections, 4 figures, 2 tables, 2 algorithms.

Figures (4)

  • Figure 1: Model partitioning scheme of MDI-LLM.
  • Figure 2: Recurrent pipeline parallelism for MDI-LLM.
  • Figure 3: Time vs. generated tokens -- comparison on 300M parameters model for 3 generated samples, 800 tokens each.
  • Figure 4: Behavior at the origin -- it is possible to notice the initial slowdown due to KV cache and internal state initialization.