Table of Contents
Fetching ...

Beyond Inference: Performance Analysis of DNN Server Overheads for Computer Vision

Ahmed F. AbouElhamayed, Susanne Balle, Deshanand Singh, Mohamed S. Abdelfattah

TL;DR

This work analyzes end-to-end performance of vision DNN serving, showing that preprocessing, data movement, and queuing frequently dominate latency and throughput despite highly optimized inference. By evaluating diverse CV tasks, hardware configurations, and inter-DNN broker setups, the study demonstrates that holistic optimization—beyond accelerator improvements—is essential. Key findings include preprocessing overheads accounting for up to 56% of latency in medium images and ~97% in large images, queuing contributing up to a large share of latency under high concurrency, and substantial throughput gains (up to 2.25x) when adopting optimized serving and in-memory brokers like Redis. The results advocate for system-wide design choices, including GPU-accelerated preprocessing, dynamic batching, and efficient inter-DNN communication, to unlock better end-to-end performance in real-world deployments.

Abstract

Deep neural network (DNN) inference has become an important part of many data-center workloads. This has prompted focused efforts to design ever-faster deep learning accelerators such as GPUs and TPUs. However, an end-to-end DNN-based vision application contains more than just DNN inference, including input decompression, resizing, sampling, normalization, and data transfer. In this paper, we perform a thorough evaluation of computer vision inference requests performed on a throughput-optimized serving system. We quantify the performance impact of server overheads such as data movement, preprocessing, and message brokers between two DNNs producing outputs at different rates. Our empirical analysis encompasses many computer vision tasks including image classification, segmentation, detection, depth-estimation, and more complex processing pipelines with multiple DNNs. Our results consistently demonstrate that end-to-end application performance can easily be dominated by data processing and data movement functions (up to 56% of end-to-end latency in a medium-sized image, and $\sim$ 80% impact on system throughput in a large image), even though these functions have been conventionally overlooked in deep learning system design. Our work identifies important performance bottlenecks in different application scenarios, achieves 2.25$\times$ better throughput compared to prior work, and paves the way for more holistic deep learning system design.

Beyond Inference: Performance Analysis of DNN Server Overheads for Computer Vision

TL;DR

This work analyzes end-to-end performance of vision DNN serving, showing that preprocessing, data movement, and queuing frequently dominate latency and throughput despite highly optimized inference. By evaluating diverse CV tasks, hardware configurations, and inter-DNN broker setups, the study demonstrates that holistic optimization—beyond accelerator improvements—is essential. Key findings include preprocessing overheads accounting for up to 56% of latency in medium images and ~97% in large images, queuing contributing up to a large share of latency under high concurrency, and substantial throughput gains (up to 2.25x) when adopting optimized serving and in-memory brokers like Redis. The results advocate for system-wide design choices, including GPU-accelerated preprocessing, dynamic batching, and efficient inter-DNN communication, to unlock better end-to-end performance in real-world deployments.

Abstract

Deep neural network (DNN) inference has become an important part of many data-center workloads. This has prompted focused efforts to design ever-faster deep learning accelerators such as GPUs and TPUs. However, an end-to-end DNN-based vision application contains more than just DNN inference, including input decompression, resizing, sampling, normalization, and data transfer. In this paper, we perform a thorough evaluation of computer vision inference requests performed on a throughput-optimized serving system. We quantify the performance impact of server overheads such as data movement, preprocessing, and message brokers between two DNNs producing outputs at different rates. Our empirical analysis encompasses many computer vision tasks including image classification, segmentation, detection, depth-estimation, and more complex processing pipelines with multiple DNNs. Our results consistently demonstrate that end-to-end application performance can easily be dominated by data processing and data movement functions (up to 56% of end-to-end latency in a medium-sized image, and 80% impact on system throughput in a large image), even though these functions have been conventionally overlooked in deep learning system design. Our work identifies important performance bottlenecks in different application scenarios, achieves 2.25 better throughput compared to prior work, and paves the way for more holistic deep learning system design.
Paper Structure (16 sections, 12 figures)

This paper contains 16 sections, 12 figures.

Figures (12)

  • Figure 1: Sample API system serving a DNN on a GPU.
  • Figure 2: A sample DNN application consisting of preprocessing and inference with annotated server parameters.
  • Figure 3: Evaluation of throughput across diverse system setups running the same Vision Transformer (ViT) model.
  • Figure 4: Throughput and inference time percentage for various HuggingFace models for both CPU/GPU preprocessing.
  • Figure 5: Throughput, average latency, and queuing time of a throughput-optimized inference server at different concurrencies.
  • ...and 7 more figures