Quality at the Tail of Machine Learning Inference

Zhengxin Yang; Wanling Gao; Chunjie Luo; Lei Wang; Fei Tang; Xu Wen; Jianfeng Zhan

Quality at the Tail of Machine Learning Inference

Zhengxin Yang, Wanling Gao, Chunjie Luo, Lei Wang, Fei Tang, Xu Wen, Jianfeng Zhan

TL;DR

The paper identifies a counterintuitive phenomenon: deep learning inference quality can fluctuate with inference time, especially under tight tail-latency constraints, leading to potentially catastrophic outcomes in safety-critical tasks. It formalizes tail quality as the inference quality under a tail-time threshold and proposes a two-part evaluation framework that jointly models inference-time distributions and the resulting tail-quality metric. The framework estimates per-instance time distributions via a KDE-based Monte Carlo approach and uses a convergence measure based on Jensen-Shannon divergence to predict tail-quality statistics, achieving rJSD values below 0.05 and substantial reductions in required inferences compared to MLPerf. Experimental instantiations across four systems, two frameworks, three models, and three datasets demonstrate accurate time-to-quality mappings and the ability to predict worst-case tail quality with lower computational effort, enabling proactive, risk-aware benchmarking for real-time AI deployments.

Abstract

Machine learning inference should be subject to stringent inference time constraints while ensuring high inference quality, especially in safety-critical (e.g., autonomous driving) and mission-critical (e.g., emotion recognition) contexts. Neglecting either aspect can lead to severe consequences, such as loss of life and property damage. Many studies lack a comprehensive consideration of these metrics, leading to incomplete or misleading evaluations. The study unveils a counterintuitive revelation: deep learning inference quality exhibits fluctuations due to inference time. To depict this phenomenon, the authors coin a new term, "tail quality," providing a more comprehensive evaluation, and overcoming conventional metric limitations. Moreover, the research proposes an initial evaluation framework to analyze factors affecting quality fluctuations, facilitating the prediction of the potential distribution of inference quality. The effectiveness of the evaluation framework is validated through experiments conducted on deep learning models for three different tasks across four systems.

Quality at the Tail of Machine Learning Inference

TL;DR

Abstract

Paper Structure (12 sections, 5 equations, 4 figures, 4 tables, 2 algorithms)

This paper contains 12 sections, 5 equations, 4 figures, 4 tables, 2 algorithms.

Introduction
Tail Quality and Evaluation Framework
Definition of Tail Quality
Establishment of Evaluation Framework
Experimental Analysis and Results
Instantiation of the Evaluation Framework
Validation of Effectiveness
Analysis of Influencing Components
Related Work
Optimization of Inference Quality and Time
Benchmarking of Deep Learning Inference
Conclusion

Figures (4)

Figure 1: Quality Fluctuations of an Image Classification Model Vision Transformer (ViT) on Tesla P100. The triangular symbols from left to right represent the maximum and minimum values of inference quality, using the 90th, 95th, and 99th percentiles of tail latency as the inference time thresholds.
Figure 2: The relationship between the size of the instances of the dataset (COCO and MMLU) and the corresponding average inference time, which is established using linear regression. Sub-figures (a) and (b) present experimental results for the DETR model on server A, B, and C. Sub-figure (c) illustrates the experimental results for the Vicuna model implemented in the PyTorch framework on server A and D.
Figure 3: Fitting of probability density functions for inference times of instances with varying sizes, as well as for instances of the same size on different systems. The histograms represent the frequencies of instance inference times corresponding to the left $y$ axis. The values of the fitted probability density curves corresponding to the right $y$ axis.
Figure 4: The Jensen-Shannon Divergence (JSD) between the probability density distributions of inference times fitted for each instance is depicted. Both $x$ and $y$ axes of the heatmap are sorted in ascending order according to the size of the images. This heatmap displays only 100 instances with distinct sizes present in the dataset.

Quality at the Tail of Machine Learning Inference

TL;DR

Abstract

Quality at the Tail of Machine Learning Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (4)