An Explorative Study on Distributed Computing Techniques in Training and Inference of Large Language Models
Sheikh Azizul Hakim, Saem Hasan
TL;DR
The paper examines distributed computing approaches to enable large language model (LLM) training and inference at scale. It first surveys PETALS, a resource-pooling system that democratizes LLM usage on consumer hardware and introduces a NSGA-II–based metaheuristic to optimize the chain of servers for latency and throughput. It then conducts a comparative study of three state-of-the-art LLM serving frameworks (ORCA, vLLM, InfiniteLLM), detailing architectural and scheduling strategies such as iteration-based scheduling, PagedAttention, and distributed KV-cache management. The findings highlight substantial throughput gains from distributed serving techniques and memory-management innovations, while also noting practical limitations (e.g., testing constraints on public swarms) and the need for further work in efficient decoding and cross-node resource coordination. Overall, the work demonstrates viable pathways for scalable LLM usage with commodity hardware and outlines architectural directions for future distributed serving systems.
Abstract
Large language models (LLM) are advanced AI systems trained on extensive textual data, leveraging deep learning techniques to understand and generate human-like language. Today's LLMs with billions of parameters are so huge that hardly any single computing node can train, fine-tune, or infer from them. Therefore, several distributed computing techniques are being introduced in the literature to properly utilize LLMs. We have explored the application of distributed computing techniques in LLMs from two angles. \begin{itemize} \item We study the techniques that democratize the LLM, that is, how large models can be run on consumer-grade computers. Here, we also implement a novel metaheuristics-based modification to an existing system. \item We perform a comparative study on three state-of-the-art LLM serving techniques. \end{itemize}
