Resource-aware Deployment of Dynamic DNNs over Multi-tiered Interconnected Systems
Chetna Singhal, Yashuo Wu, Francesco Malandrino, Marco Levorato, Carla Fabiana Chiasserini
TL;DR
This work tackles energy-efficient inference for multi-tier mobile-edge-cloud systems using dynamic DNNs with early exits. It introduces a graph-based framework, FIN, that transforms the allocation of DNN blocks across heterogeneous nodes into a feasible, minimum-energy path problem on a replicated-vertex graph, guaranteeing latency and accuracy constraints. The approach is proven NP-hard, yet FIN provides a provable approximation via gamma-parameterized latency replication and lambda-proximity pruning, achieving near-optimal energy savings. Empirical results across three branchy DNNs show FIN reduces inference energy by over 65% compared with a state-of-the-art cost-minimization method, while maintaining required latency and accuracy and enabling scalable multi-application deployments. The framework advances practical resource-aware ML by jointly optimizing model splitting and deployment under multi-tier constraints, with strong implications for sustainable, on-device, edge, and cloud orchestration of DNN inference.
Abstract
The increasing pervasiveness of intelligent mobile applications requires to exploit the full range of resources offered by the mobile-edge-cloud network for the execution of inference tasks. However, due to the heterogeneity of such multi-tiered networks, it is essential to make the applications' demand amenable to the available resources while minimizing energy consumption. Modern dynamic deep neural networks (DNN) achieve this goal by designing multi-branched architectures where early exits enable sample-based adaptation of the model depth. In this paper, we tackle the problem of allocating sections of DNNs with early exits to the nodes of the mobile-edge-cloud system. By envisioning a 3-stage graph-modeling approach, we represent the possible options for splitting the DNN and deploying the DNN blocks on the multi-tiered network, embedding both the system constraints and the application requirements in a convenient and efficient way. Our framework -- named Feasible Inference Graph (FIN) -- can identify the solution that minimizes the overall inference energy consumption while enabling distributed inference over the multi-tiered network with the target quality and latency. Our results, obtained for DNNs with different levels of complexity, show that FIN matches the optimum and yields over 65% energy savings relative to a state-of-the-art technique for cost minimization.
