Table of Contents
Fetching ...

Towards Performance-Aware Allocation for Accelerated Machine Learning on GPU-SSD Systems

Ayush Gundawar, Euijun Chung, Hyesoon Kim

TL;DR

The paper tackles data-intensive ML workloads where GPU DRAM is insufficient, causing CPU-mediated data movement to become a dominant latency source over PCIe. It introduces MQMS, an in-storage GPU architecture and simulator that is SSD-aware, enabling dynamic address allocation and fine-grained address mapping to exploit internal SSD parallelism and reduce read-modify-write overhead, with throughput scaling as $O(\min(n,p))$ across planes. Using Allegro-based kernel sampling to generate compact GPU traces and evaluating against a MQSim-MacSim baseline on enterprise SSD configurations across LLM and classical workloads, the study reports up to orders-of-magnitude improvements in IOPS, device response time, and simulation end time. This work demonstrates the feasibility of high-performance, data-centric GPU-SSD systems and provides policy-optimization insights for scheduling and allocation strategies in enterprise storage pipelines.

Abstract

The exponential growth of data-intensive machine learning workloads has exposed significant limitations in conventional GPU-accelerated systems, especially when processing datasets exceeding GPU DRAM capacity. We propose MQMS, an augmented in-storage GPU architecture and simulator that is aware of internal SSD states and operations, enabling intelligent scheduling and address allocation to overcome performance bottlenecks caused by CPU-mediated data access patterns. MQMS introduces dynamic address allocation to maximize internal parallelism and fine-grained address mapping to efficiently handle small I/O requests without incurring read-modify-write overheads. Through extensive evaluations on workloads ranging from large language model inference to classical machine learning algorithms, MQMS demonstrates orders-of-magnitude improvements in I/O request throughput, device response time, and simulation end time compared to existing simulators.

Towards Performance-Aware Allocation for Accelerated Machine Learning on GPU-SSD Systems

TL;DR

The paper tackles data-intensive ML workloads where GPU DRAM is insufficient, causing CPU-mediated data movement to become a dominant latency source over PCIe. It introduces MQMS, an in-storage GPU architecture and simulator that is SSD-aware, enabling dynamic address allocation and fine-grained address mapping to exploit internal SSD parallelism and reduce read-modify-write overhead, with throughput scaling as across planes. Using Allegro-based kernel sampling to generate compact GPU traces and evaluating against a MQSim-MacSim baseline on enterprise SSD configurations across LLM and classical workloads, the study reports up to orders-of-magnitude improvements in IOPS, device response time, and simulation end time. This work demonstrates the feasibility of high-performance, data-centric GPU-SSD systems and provides policy-optimization insights for scheduling and allocation strategies in enterprise storage pipelines.

Abstract

The exponential growth of data-intensive machine learning workloads has exposed significant limitations in conventional GPU-accelerated systems, especially when processing datasets exceeding GPU DRAM capacity. We propose MQMS, an augmented in-storage GPU architecture and simulator that is aware of internal SSD states and operations, enabling intelligent scheduling and address allocation to overcome performance bottlenecks caused by CPU-mediated data access patterns. MQMS introduces dynamic address allocation to maximize internal parallelism and fine-grained address mapping to efficiently handle small I/O requests without incurring read-modify-write overheads. Through extensive evaluations on workloads ranging from large language model inference to classical machine learning algorithms, MQMS demonstrates orders-of-magnitude improvements in I/O request throughput, device response time, and simulation end time compared to existing simulators.

Paper Structure

This paper contains 9 sections, 9 figures, 1 table.

Figures (9)

  • Figure 1: Dynamic Address Allocation
  • Figure 2: Coarse-Grained Mapping
  • Figure 3: Fine-Grained Mapping
  • Figure 4: IOPS by Workload
  • Figure 5: Device Response Time by Workload
  • ...and 4 more figures