Table of Contents
Fetching ...

Many Hands Make Light Work: Accelerating Edge Inference via Multi-Client Collaborative Caching

Wenyi Liang, Jianchun Liu, Hongli Xu, Chunming Qiao, Liusheng Huang

TL;DR

CoCa tackles the challenge of accelerating edge inference under non-IID and long-tail data distributions by introducing a multi-client collaborative caching framework. It combines a two-dimensional global cache with adaptive client-side cache allocation and periodic server updates, balancing lookup overhead and accuracy through a heuristic ACA algorithm. The key contributions are the global cache update mechanism, dynamic cache-layer activation, and an adaptive hot-spot class selection strategy that jointly reduce latency (up to ~45% in experiments) with minimal accuracy loss. This approach enables scalable, privacy-preserving edge inference with robust performance under diverse data distributions and model architectures, offering practical benefits for latency-sensitive applications like autonomous driving and smart surveillance.

Abstract

Edge inference is a technology that enables real-time data processing and analysis on clients near the data source. To ensure compliance with the Service-Level Objectives (SLOs), such as a 30% latency reduction target, caching is usually adopted to reduce redundant computations in inference tasks on stream data. Due to task and data correlations, sharing cache information among clients can improve the inference performance. However, the non-independent and identically distributed (non-IID) nature of data across different clients and the long-tail distributions, where some classes have significantly more samples than others, will reduce cache hit ratios and increase latency. To address the aforementioned challenges, we propose an efficient inference framework, CoCa, which leverages a multi-client collaborative caching mechanism to accelerate edge inference. On the client side, the model is pre-set with multiple cache layers to achieve a quick inference. During inference, the model performs sequential lookups at cache layers activated by the edge server. On the server side, CoCa uses a two-dimensional global cache to periodically aggregate information from clients, mitigating the effects of non-IID data. For client cache allocation, CoCa first evaluates the importance of classes based on how frequently and recently their samples have been accessed. CoCa then selects frequently recurring classes to address long-tail distribution challenges. Finally, CoCa dynamically activates cache layers to balance lookup overhead and accuracy. Extensive experiments demonstrate that CoCa reduces inference latency by 23.0% to 45.2% on the VGG, ResNet and AST models with a slight loss of accuracy.

Many Hands Make Light Work: Accelerating Edge Inference via Multi-Client Collaborative Caching

TL;DR

CoCa tackles the challenge of accelerating edge inference under non-IID and long-tail data distributions by introducing a multi-client collaborative caching framework. It combines a two-dimensional global cache with adaptive client-side cache allocation and periodic server updates, balancing lookup overhead and accuracy through a heuristic ACA algorithm. The key contributions are the global cache update mechanism, dynamic cache-layer activation, and an adaptive hot-spot class selection strategy that jointly reduce latency (up to ~45% in experiments) with minimal accuracy loss. This approach enables scalable, privacy-preserving edge inference with robust performance under diverse data distributions and model architectures, offering practical benefits for latency-sensitive applications like autonomous driving and smart surveillance.

Abstract

Edge inference is a technology that enables real-time data processing and analysis on clients near the data source. To ensure compliance with the Service-Level Objectives (SLOs), such as a 30% latency reduction target, caching is usually adopted to reduce redundant computations in inference tasks on stream data. Due to task and data correlations, sharing cache information among clients can improve the inference performance. However, the non-independent and identically distributed (non-IID) nature of data across different clients and the long-tail distributions, where some classes have significantly more samples than others, will reduce cache hit ratios and increase latency. To address the aforementioned challenges, we propose an efficient inference framework, CoCa, which leverages a multi-client collaborative caching mechanism to accelerate edge inference. On the client side, the model is pre-set with multiple cache layers to achieve a quick inference. During inference, the model performs sequential lookups at cache layers activated by the edge server. On the server side, CoCa uses a two-dimensional global cache to periodically aggregate information from clients, mitigating the effects of non-IID data. For client cache allocation, CoCa first evaluates the importance of classes based on how frequently and recently their samples have been accessed. CoCa then selects frequently recurring classes to address long-tail distribution challenges. Finally, CoCa dynamically activates cache layers to balance lookup overhead and accuracy. Extensive experiments demonstrate that CoCa reduces inference latency by 23.0% to 45.2% on the VGG, ResNet and AST models with a slight loss of accuracy.

Paper Structure

This paper contains 34 sections, 11 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: The test result of ResNet101 on a subset of 50 classes from UCF101. The left plot shows the latency and accuracy performance of settings with different cache sizes. The right plot shows the hit ratio and accuracy performance of different cache layers.
  • Figure 2: The t-SNE visualization of cosine similarity clustering for the semantic vectors of samples from 4 classes of the UCF101 dataset, cached in the 18-th layer of all 34 cache layers in ResNet101. The larger points represent the cached semantic centers for each class. "Previous" and "After" show the cached entries before and after employing the global updates strategy, respectively.
  • Figure 3: Overview of CoCa.
  • Figure 4: An illustrative example of CoCa at round $T$. The server maintains a 5×5 global cache. Rows correspond to different classes, while columns correspond to different cache layers. Dashed outlines indicate empty cache layers, and dark blocks represent semantic cache entries.
  • Figure 5: The impact of different threshold values $\Theta$.
  • ...and 5 more figures