Table of Contents
Fetching ...

Resource-aware Deployment of Dynamic DNNs over Multi-tiered Interconnected Systems

Chetna Singhal, Yashuo Wu, Francesco Malandrino, Marco Levorato, Carla Fabiana Chiasserini

TL;DR

This work tackles energy-efficient inference for multi-tier mobile-edge-cloud systems using dynamic DNNs with early exits. It introduces a graph-based framework, FIN, that transforms the allocation of DNN blocks across heterogeneous nodes into a feasible, minimum-energy path problem on a replicated-vertex graph, guaranteeing latency and accuracy constraints. The approach is proven NP-hard, yet FIN provides a provable approximation via gamma-parameterized latency replication and lambda-proximity pruning, achieving near-optimal energy savings. Empirical results across three branchy DNNs show FIN reduces inference energy by over 65% compared with a state-of-the-art cost-minimization method, while maintaining required latency and accuracy and enabling scalable multi-application deployments. The framework advances practical resource-aware ML by jointly optimizing model splitting and deployment under multi-tier constraints, with strong implications for sustainable, on-device, edge, and cloud orchestration of DNN inference.

Abstract

The increasing pervasiveness of intelligent mobile applications requires to exploit the full range of resources offered by the mobile-edge-cloud network for the execution of inference tasks. However, due to the heterogeneity of such multi-tiered networks, it is essential to make the applications' demand amenable to the available resources while minimizing energy consumption. Modern dynamic deep neural networks (DNN) achieve this goal by designing multi-branched architectures where early exits enable sample-based adaptation of the model depth. In this paper, we tackle the problem of allocating sections of DNNs with early exits to the nodes of the mobile-edge-cloud system. By envisioning a 3-stage graph-modeling approach, we represent the possible options for splitting the DNN and deploying the DNN blocks on the multi-tiered network, embedding both the system constraints and the application requirements in a convenient and efficient way. Our framework -- named Feasible Inference Graph (FIN) -- can identify the solution that minimizes the overall inference energy consumption while enabling distributed inference over the multi-tiered network with the target quality and latency. Our results, obtained for DNNs with different levels of complexity, show that FIN matches the optimum and yields over 65% energy savings relative to a state-of-the-art technique for cost minimization.

Resource-aware Deployment of Dynamic DNNs over Multi-tiered Interconnected Systems

TL;DR

This work tackles energy-efficient inference for multi-tier mobile-edge-cloud systems using dynamic DNNs with early exits. It introduces a graph-based framework, FIN, that transforms the allocation of DNN blocks across heterogeneous nodes into a feasible, minimum-energy path problem on a replicated-vertex graph, guaranteeing latency and accuracy constraints. The approach is proven NP-hard, yet FIN provides a provable approximation via gamma-parameterized latency replication and lambda-proximity pruning, achieving near-optimal energy savings. Empirical results across three branchy DNNs show FIN reduces inference energy by over 65% compared with a state-of-the-art cost-minimization method, while maintaining required latency and accuracy and enabling scalable multi-application deployments. The framework advances practical resource-aware ML by jointly optimizing model splitting and deployment under multi-tier constraints, with strong implications for sustainable, on-device, edge, and cloud orchestration of DNN inference.

Abstract

The increasing pervasiveness of intelligent mobile applications requires to exploit the full range of resources offered by the mobile-edge-cloud network for the execution of inference tasks. However, due to the heterogeneity of such multi-tiered networks, it is essential to make the applications' demand amenable to the available resources while minimizing energy consumption. Modern dynamic deep neural networks (DNN) achieve this goal by designing multi-branched architectures where early exits enable sample-based adaptation of the model depth. In this paper, we tackle the problem of allocating sections of DNNs with early exits to the nodes of the mobile-edge-cloud system. By envisioning a 3-stage graph-modeling approach, we represent the possible options for splitting the DNN and deploying the DNN blocks on the multi-tiered network, embedding both the system constraints and the application requirements in a convenient and efficient way. Our framework -- named Feasible Inference Graph (FIN) -- can identify the solution that minimizes the overall inference energy consumption while enabling distributed inference over the multi-tiered network with the target quality and latency. Our results, obtained for DNNs with different levels of complexity, show that FIN matches the optimum and yields over 65% energy savings relative to a state-of-the-art technique for cost minimization.
Paper Structure (12 sections, 4 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 12 sections, 4 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: Our framework allocates blocks of layers of DNNs with early-exits (EEs) to mobile-edge-cloud systems so that energy expenditure is minimized. Multiple applications coexist, and an orchestrator controls the execution of DNN blocks and information flow across them based on application requirements (accuracy and latency) and system constraints (bandwidth, computing capacity). For Application 1, the orchestrator allocates two blocks with early exits (EE1 and EE2) to a mobile node and an edge server, while for Application 2 the target performance is achieved by using the first two early exits.
  • Figure 2: The graphs used in our solution strategy: two-dimensional, two-plane system model (left); single-plane extended graph (center); feasible graph (right).
  • Figure 3: Solution strategy and steps within FIN.
  • Figure 4: Impact of the configurations listed in Table \ref{['t:config']}: inference latency (left) and energy consumption (center) for the B-AlexNet and B-ResNet; inference accuracy for B-AlexNet-based $h_1$, $h_2$ and B-ResNet-based $h_3$, $h_4$ (right).
  • Figure 5: Total energy consumption of the B-AlexNet configurations obtained through Opt, MCP, and FIN ($\gamma{=}3,10$), as the target inference latency and accuracy vary.
  • ...and 3 more figures