Towards Performance-Aware Allocation for Accelerated Machine Learning on GPU-SSD Systems
Ayush Gundawar, Euijun Chung, Hyesoon Kim
TL;DR
The paper tackles data-intensive ML workloads where GPU DRAM is insufficient, causing CPU-mediated data movement to become a dominant latency source over PCIe. It introduces MQMS, an in-storage GPU architecture and simulator that is SSD-aware, enabling dynamic address allocation and fine-grained address mapping to exploit internal SSD parallelism and reduce read-modify-write overhead, with throughput scaling as $O(\min(n,p))$ across planes. Using Allegro-based kernel sampling to generate compact GPU traces and evaluating against a MQSim-MacSim baseline on enterprise SSD configurations across LLM and classical workloads, the study reports up to orders-of-magnitude improvements in IOPS, device response time, and simulation end time. This work demonstrates the feasibility of high-performance, data-centric GPU-SSD systems and provides policy-optimization insights for scheduling and allocation strategies in enterprise storage pipelines.
Abstract
The exponential growth of data-intensive machine learning workloads has exposed significant limitations in conventional GPU-accelerated systems, especially when processing datasets exceeding GPU DRAM capacity. We propose MQMS, an augmented in-storage GPU architecture and simulator that is aware of internal SSD states and operations, enabling intelligent scheduling and address allocation to overcome performance bottlenecks caused by CPU-mediated data access patterns. MQMS introduces dynamic address allocation to maximize internal parallelism and fine-grained address mapping to efficiently handle small I/O requests without incurring read-modify-write overheads. Through extensive evaluations on workloads ranging from large language model inference to classical machine learning algorithms, MQMS demonstrates orders-of-magnitude improvements in I/O request throughput, device response time, and simulation end time compared to existing simulators.
