GDEV-AI: A Generalized Evaluation of Deep Learning Inference Scaling and Architectural Saturation

Kathiravan Palaniappan

GDEV-AI: A Generalized Evaluation of Deep Learning Inference Scaling and Architectural Saturation

Kathiravan Palaniappan

TL;DR

The paper benchmarks CPU-only CNN inference across a legacy Xeon and a Granite Rapids platform to map throughput, latency, and saturation points, introducing the GDEV-AI reproducible framework. It demonstrates a dramatic generational gap: Granite Rapids yields up to ~32x higher throughput for ResNet-50 than the legacy system, with batching up to around B=8 providing substantial gains, while oversubscription degrades performance. A Roofline-style interpretation links memory bandwidth, cache size, and AMX-enabled compute as core determinants of scalability, offering practical guidance for capacity planning in heterogeneous data centers. The work provides actionable insights for deploying CPU-based inference in resource-constrained environments and establishes a baseline for future CPU-vs-GPU comparisons.

Abstract

The deployment of deep learning inference in production environments continues to grow, where throughput, latency, and hardware efficiency are critical. Although specialized accelerators are increasingly adopted, many inference workloads still run on CPU-only systems, particularly in legacy data centers and cost-sensitive environments. This study investigates the scalability limits of CPU-based inference for convolutional neural networks by benchmarking ResNet models across varying batch sizes on two hardware tiers: a legacy Intel Xeon E5-2403 v2 processor and a modern Intel Xeon 6 "Granite Rapids" platform. Results show that legacy CPUs quickly reach throughput saturation, with limited scaling beyond small batch sizes due to instruction-level and memory constraints. In contrast, the Granite Rapids system leverages Intel Advanced Matrix Extensions (AMX) to achieve substantially higher throughput. However, oversubscription beyond physical core limits introduces execution contention and tail-latency amplification, revealing a performance degradation regime in modern architectures. We introduce GDEV-AI, a reproducible benchmarking framework for analyzing scalability behavior and architectural saturation in CPU-based inference. By establishing a vendor-neutral baseline, this work provides empirical insight into performance bottlenecks and informs capacity planning in heterogeneous data center environments.

GDEV-AI: A Generalized Evaluation of Deep Learning Inference Scaling and Architectural Saturation

TL;DR

Abstract

Paper Structure (27 sections, 9 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 27 sections, 9 equations, 7 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Inference Serving and Tail Latency
Benchmarking and Datacenter ML Inference Workloads
CPU/GPU Performance Modeling and Architectural Limits
Deployment Context and Implications
Operational Resource Constraints
Widespread Adoption of AI Inference and Capacity Implications
Experimental Setup
Experimental Methodology
Benchmarking Methodology
Hardware Platform
Software Environment
Models and Inference Configuration
CPU Inference Throughput Scaling
...and 12 more sections

Figures (7)

Figure 1: Evolution of Intel Xeon server processors across generations, showcasing representative SKUs over time. This figure places the legacy Xeon E5-2403 v2 platform, evaluated in this study, in context with a modern Granite Rapids-class Xeon system. Both are experimentally assessed to quantify generational performance scaling and architectural efficiency.
Figure 2: Benchmark execution structure. For each model (ResNet50, ResNet18), we run a sweep over batch sizes and thread counts, repeating each configuration for multiple repetitions (R1--R3) to capture run-to-run variability.
Figure 3: Peak-to-peak CPU Inference Throughput Comparison: Evaluation of ResNet architectures across legacy (4-thread) and modern (24-thread) environments. To ensure execution determinism and filter system jitter, all trend lines report the median images per second (IPS). The modern server is pinned to 24 physical cores to avoid hyperthreading-induced performance cliffs, representing a comparison of optimized system capacities across hardware generations.
Figure 4: CPU inference latency characteristics comparing the legacy Xeon and modern Granite Rapids platforms. Top row: latency as a function of batch size under constrained parallelism, with thread counts limited to $1$--$4$ on both platforms to maintain hardware parity. Bottom row: tail latency analysis (median vs. P99) conducted at a fixed $4$-thread configuration, highlighting production reliability under identical execution resources.
Figure 5: Comparative inference performance of ResNet-18 and ResNet-50 across CPU generations. The top row shows results for the legacy Intel Xeon E5-2403 v2 evaluated at its maximum capacity of 4 physical cores, while the bottom row presents results for the modern Intel Xeon Granite Rapids evaluated at its maximum capacity of 24 physical cores. Columns, from left to right, report (a,d) median inference latency as a function of batch size, (b,e) median versus P99 tail latency at $B=1$, and (c,f) multi-thread speedup relative to single-thread execution. Latency values are reported as medians across independent sweeps to mitigate the influence of outliers. Speedup results extend beyond the physical core count to include logical cores, exposing the effects of oversubscription and shared-resource contention. Overall, the figure highlights the contrasting scalability limits and architectural behavior of legacy and modern CPU platforms under full utilization.
...and 2 more figures

GDEV-AI: A Generalized Evaluation of Deep Learning Inference Scaling and Architectural Saturation

TL;DR

Abstract

GDEV-AI: A Generalized Evaluation of Deep Learning Inference Scaling and Architectural Saturation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)