Table of Contents
Fetching ...

Hybrid-Parallel: Achieving High Performance and Energy Efficient Distributed Inference on Robots

Zekai Sun, Xiuxian Guan, Junming Wang, Haoze Song, Yuhao Qing, Tianxiang Shen, Dong Huang, Fangming Liu, Heming Cui

TL;DR

This paper tackles the latency and energy challenges of deploying deep neural network inference on robotic IoT by addressing bandwidth-constrained wireless links and real-time response requirements.It introduces Hybrid-Parallel, a fine-grained local-operator parallelism approach (LOP) combined with a bandwidth-aware scheduling strategy (LOSS) that overlaps computation and transmission and avoids costly all-reduce steps for local operators.Key contributions include the classification of local vs global operators, the LOP framework to ensure correctness, and LOSS to optimally split work between robot and GPU server via differential evolution, with plans precomputed for different bandwidths and switchable at runtime.Empirical results on indoor/outdoor robotic setups show up to 41.1% reduction in inference time and up to 35.3% reduction in energy per inference, demonstrating robust, scalable, and easy-to-integrate improvements for real-world robotic inference workloads.

Abstract

The rapid advancements in machine learning techniques have led to significant achievements in various real-world robotic tasks. These tasks heavily rely on fast and energy-efficient inference of deep neural network (DNN) models when deployed on robots. To enhance inference performance, distributed inference has emerged as a promising approach, parallelizing inference across multiple powerful GPU devices in modern data centers using techniques such as data parallelism, tensor parallelism, and pipeline parallelism. However, when deployed on real-world robots, existing parallel methods fail to provide low inference latency and meet the energy requirements due to the limited bandwidth of robotic IoT. We present Hybrid-Parallel, a high-performance distributed inference system optimized for robotic IoT. Hybrid-Parallel employs a fine-grained approach to parallelize inference at the granularity of local operators within DNN layers (i.e., operators that can be computed independently with the partial input, such as the convolution kernel in the convolution layer). By doing so, Hybrid-Parallel enables different operators of different layers to be computed and transmitted concurrently, and overlap the computation and transmission phases within the same inference task. The evaluation demonstrate that Hybrid-Parallel reduces inference time by 14.9% ~41.1% and energy consumption per inference by up to 35.3% compared to the state-of-the-art baselines.

Hybrid-Parallel: Achieving High Performance and Energy Efficient Distributed Inference on Robots

TL;DR

This paper tackles the latency and energy challenges of deploying deep neural network inference on robotic IoT by addressing bandwidth-constrained wireless links and real-time response requirements.It introduces Hybrid-Parallel, a fine-grained local-operator parallelism approach (LOP) combined with a bandwidth-aware scheduling strategy (LOSS) that overlaps computation and transmission and avoids costly all-reduce steps for local operators.Key contributions include the classification of local vs global operators, the LOP framework to ensure correctness, and LOSS to optimally split work between robot and GPU server via differential evolution, with plans precomputed for different bandwidths and switchable at runtime.Empirical results on indoor/outdoor robotic setups show up to 41.1% reduction in inference time and up to 35.3% reduction in energy per inference, demonstrating robust, scalable, and easy-to-integrate improvements for real-world robotic inference workloads.

Abstract

The rapid advancements in machine learning techniques have led to significant achievements in various real-world robotic tasks. These tasks heavily rely on fast and energy-efficient inference of deep neural network (DNN) models when deployed on robots. To enhance inference performance, distributed inference has emerged as a promising approach, parallelizing inference across multiple powerful GPU devices in modern data centers using techniques such as data parallelism, tensor parallelism, and pipeline parallelism. However, when deployed on real-world robots, existing parallel methods fail to provide low inference latency and meet the energy requirements due to the limited bandwidth of robotic IoT. We present Hybrid-Parallel, a high-performance distributed inference system optimized for robotic IoT. Hybrid-Parallel employs a fine-grained approach to parallelize inference at the granularity of local operators within DNN layers (i.e., operators that can be computed independently with the partial input, such as the convolution kernel in the convolution layer). By doing so, Hybrid-Parallel enables different operators of different layers to be computed and transmitted concurrently, and overlap the computation and transmission phases within the same inference task. The evaluation demonstrate that Hybrid-Parallel reduces inference time by 14.9% ~41.1% and energy consumption per inference by up to 35.3% compared to the state-of-the-art baselines.
Paper Structure (21 sections, 3 equations, 9 figures, 7 tables, 2 algorithms)

This paper contains 21 sections, 3 equations, 9 figures, 7 tables, 2 algorithms.

Figures (9)

  • Figure 1: Existing distributed inference approaches on VGG19 simonyan2015deep in our experiments, which adopt PP paradigm with various layer partitioning scheduling strategies. The X-axis of the graph represents different layer partitioning strategies, where 'layer i' indicates that all layers up to and including the $i_{th}$ layer are computed on the robot, while the subsequent layers are processed on the GPU server.
  • Figure 2: The instability of wireless transmission between our robot and a base station in robotic IoT networks.
  • Figure 3: Workflow of Hybrid-Parallel. Each local operator layer have to complete the calculation of three local operators, and the same local operator in the three cases has the same computation time on robots and GPU servers, as well as the corresponding transmission time. The output tensor volume of layer 2 is larger than that of layer 1, resulting in longer transmission times for local operators in layer 2, and PP selects a layer partition strategy at layer 1 liang2023dnn.
  • Figure 4: Architecture of Hybrid-Parallel. The core components of Hybrid-Parallel are highlighted in purple. Hybrid-Parallel adopts the same scheduling scheme as in Fig. \ref{['fig:overview']}.
  • Figure 5: An example of applying Hybrid-Parallel to a VGG19 simonyan2015deep model, where "192.168.50.1" is the IP address of the GPU server.
  • ...and 4 more figures