Table of Contents
Fetching ...

Communication-Computation Trade-Off in Resource-Constrained Edge Inference

Jiawei Shao, Jun Zhang

TL;DR

This article presents effective methods for edge inference at resource-constrained devices, focusing on device-edge co-inference, assisted by an edge computing server, and investigates a critical trade-off among the computational cost of the on-device model and the communication overhead of forwarding the intermediate feature to the edge server.

Abstract

The recent breakthrough in artificial intelligence (AI), especially deep neural networks (DNNs), has affected every branch of science and technology. Particularly, edge AI has been envisioned as a major application scenario to provide DNN-based services at edge devices. This article presents effective methods for edge inference at resource-constrained devices. It focuses on device-edge co-inference, assisted by an edge computing server, and investigates a critical trade-off among the computation cost of the on-device model and the communication cost of forwarding the intermediate feature to the edge server. A three-step framework is proposed for the effective inference: (1) model split point selection to determine the on-device model, (2) communication-aware model compression to reduce the on-device computation and the resulting communication overhead simultaneously, and (3) task-oriented encoding of the intermediate feature to further reduce the communication overhead. Experiments demonstrate that our proposed framework achieves a better trade-off and significantly reduces the inference latency than baseline methods.

Communication-Computation Trade-Off in Resource-Constrained Edge Inference

TL;DR

This article presents effective methods for edge inference at resource-constrained devices, focusing on device-edge co-inference, assisted by an edge computing server, and investigates a critical trade-off among the computational cost of the on-device model and the communication overhead of forwarding the intermediate feature to the edge server.

Abstract

The recent breakthrough in artificial intelligence (AI), especially deep neural networks (DNNs), has affected every branch of science and technology. Particularly, edge AI has been envisioned as a major application scenario to provide DNN-based services at edge devices. This article presents effective methods for edge inference at resource-constrained devices. It focuses on device-edge co-inference, assisted by an edge computing server, and investigates a critical trade-off among the computation cost of the on-device model and the communication cost of forwarding the intermediate feature to the edge server. A three-step framework is proposed for the effective inference: (1) model split point selection to determine the on-device model, (2) communication-aware model compression to reduce the on-device computation and the resulting communication overhead simultaneously, and (3) task-oriented encoding of the intermediate feature to further reduce the communication overhead. Experiments demonstrate that our proposed framework achieves a better trade-off and significantly reduces the inference latency than baseline methods.

Paper Structure

This paper contains 24 sections, 5 figures.

Figures (5)

  • Figure 1: The communication-computation plane of edge inference in ResNet ResNet for classification tasks. (1) Trade-off: the blue and orange curves correspond to the on-device computation and communication overhead at different split points, where the blue curve corresponds to the original network, and the orange curve corresponds to the network after model compressed and feature encoding. (2) Data amplification: As suggested in JALAD, the data amplification means the communication overhead of the intermediate feature is larger than that of input data. The grey dashed line is the communication overhead of input data. (3) Special points: the red and blue stars correspond to on-device inference and server-based inference. The purple star corresponds to one case of device-edge co-inference. With model compression and feature encoding, the on-device computation and communication overhead is reduced, and the data amplification effect is alleviated.
  • Figure 2: The proposed framework of device-edge co-inference. (1) Split the Network: The input of the framework is the pre-trained DNN. The first step is to select the split point to divide the DNN into two parts. The front part of the neural network is deployed on the edge device, and the other part is offloaded on the edge server. (2) Compress the on-device model: The on-device model is compressed by incremental network pruning. In each iteration, the mask would remove the unimportant weights (set their value to 0) based on their $l_{2}$-norm. Then the unmasked weights are updated in the back-propagation. After that, the masked weight would be recovered, and the next iteration starts. In the training process, the sparsity ratio will continuously increase until it reaches the desired ratio. (3) Encode the intermediate feature: With the compressed on-device model, we use a pair of lightweight encoder-decoder structure to shrink the volume of the intermediate feature. Besides, using learning-driven source coding or joint source-channel coding, we further reduce the communication overhead by learning the mapping from each symbol to codeword.
  • Figure 3: The communication-computation trade-off curves in device-edge co-inference.
  • Figure 4: Latency as a function of communication rate. Our method can maintain the latency around 0.1s when the communication becomes poor (less than 40KB)
  • Figure 5: (a) The communication latency under different SNR's in the AWGN channel with channel bandwidth $W=1\textup{MHz}$ and (b) the communication overhead under different bit flipping rates in the Binary Symmetric Channel.