Combining Cloud and Mobile Computing for Machine Learning
Ruiqi Xu, Tianchi Zhang
TL;DR
This work investigates edge-cloud collaborative inference through model segmentation to reduce cloud workload while meeting latency SLAs. It introduces a latency-aware scheduler that considers device capability, network quality, and job requirements, and validates the approach on RegNet and Stable Diffusion. Results show stable diffusion can benefit from offloading with intelligent batching and preloading, while RegNet may not due to data transfer costs; the scheduler achieves up to substantial reductions in cloud GPU usage. The study outlines a practical path toward fog-like computing, highlights memory and security considerations, and proposes future refinements for memory management and adaptive SLA policies.
Abstract
Although the computing power of mobile devices is increasing, machine learning models are also growing in size. This trend creates problems for mobile devices due to limitations like their memory capacity and battery life. While many services, like ChatGPT and Midjourney, run all the inferences in the cloud, we believe a flexible and fine-grained task distribution is more desirable. In this work, we consider model segmentation as a solution to improving the user experience, dividing the computation between mobile devices and the cloud in a way that offloads the compute-heavy portion of the model while minimizing the data transfer required. We show that the division not only reduces the wait time for users but can also be fine-tuned to optimize the workloads of the cloud. To achieve that, we design a scheduler that collects information about network quality, client device capability, and job requirements, making decisions to achieve consistent performance across a range of devices while reducing the work the cloud needs to perform.
