Many Hands Make Light Work: Accelerating Edge Inference via Multi-Client Collaborative Caching
Wenyi Liang, Jianchun Liu, Hongli Xu, Chunming Qiao, Liusheng Huang
TL;DR
CoCa tackles the challenge of accelerating edge inference under non-IID and long-tail data distributions by introducing a multi-client collaborative caching framework. It combines a two-dimensional global cache with adaptive client-side cache allocation and periodic server updates, balancing lookup overhead and accuracy through a heuristic ACA algorithm. The key contributions are the global cache update mechanism, dynamic cache-layer activation, and an adaptive hot-spot class selection strategy that jointly reduce latency (up to ~45% in experiments) with minimal accuracy loss. This approach enables scalable, privacy-preserving edge inference with robust performance under diverse data distributions and model architectures, offering practical benefits for latency-sensitive applications like autonomous driving and smart surveillance.
Abstract
Edge inference is a technology that enables real-time data processing and analysis on clients near the data source. To ensure compliance with the Service-Level Objectives (SLOs), such as a 30% latency reduction target, caching is usually adopted to reduce redundant computations in inference tasks on stream data. Due to task and data correlations, sharing cache information among clients can improve the inference performance. However, the non-independent and identically distributed (non-IID) nature of data across different clients and the long-tail distributions, where some classes have significantly more samples than others, will reduce cache hit ratios and increase latency. To address the aforementioned challenges, we propose an efficient inference framework, CoCa, which leverages a multi-client collaborative caching mechanism to accelerate edge inference. On the client side, the model is pre-set with multiple cache layers to achieve a quick inference. During inference, the model performs sequential lookups at cache layers activated by the edge server. On the server side, CoCa uses a two-dimensional global cache to periodically aggregate information from clients, mitigating the effects of non-IID data. For client cache allocation, CoCa first evaluates the importance of classes based on how frequently and recently their samples have been accessed. CoCa then selects frequently recurring classes to address long-tail distribution challenges. Finally, CoCa dynamically activates cache layers to balance lookup overhead and accuracy. Extensive experiments demonstrate that CoCa reduces inference latency by 23.0% to 45.2% on the VGG, ResNet and AST models with a slight loss of accuracy.
