Affordable HPC: Leveraging Small Clusters for Big Data and Graph Computing

Ruilong Wu; Yisu Wang; Dirk Kutscher

Affordable HPC: Leveraging Small Clusters for Big Data and Graph Computing

Ruilong Wu, Yisu Wang, Dirk Kutscher

TL;DR

This study explores strategies for academic re-searchers to optimize computational resources within limited budgets, focusing on building small, efficient computing clusters, and proposes a Graph Neural Network (GNN) framework to analyze and optimize parallelism in computing networks.

Abstract

This study explores strategies for academic researchers to optimize computational resources within limited budgets, focusing on building small, efficient computing clusters. It delves into the comparative costs of purchasing versus renting servers, guided by market research and economic theories on tiered pricing. The paper offers detailed insights into the selection and assembly of hardware components such as CPUs, GPUs, and motherboards tailored to specific research needs. It introduces innovative methods to mitigate the performance issues caused by PCIe switch bandwidth limitations in order to enhance GPU task scheduling. Furthermore, a Graph Neural Network (GNN) framework is proposed to analyze and optimize parallelism in computing networks.

Affordable HPC: Leveraging Small Clusters for Big Data and Graph Computing

TL;DR

Abstract

Paper Structure (45 sections, 4 equations, 7 figures, 3 tables)

This paper contains 45 sections, 4 equations, 7 figures, 3 tables.

Introduction
Technical Detail Compilation
Performance Optimization
GNN for Network and Neural network parallelism
Motivation
Survey on different cloud server rental prices
Theory of Tiered Pricing
CPU Selection
CPU internal structure
AMD EPYC
Intel Xeon
Influence of Topology
Chiplet Placement
Other parameters
Power
...and 30 more sections

Figures (7)

Figure 1: (a)AMD EPYC 9004 configuration with 12 Core Complex Dies (CCD) surrounding a central I/O Die (IOD)b4 (b)Processor floorplan diagram for 2-die XCC configurationb5 (c)Standard RDMA over PCIe Transfer Process: ① Generate Work Queue Element ② Issue Doorbell ③Network Card Fetches Task ④ DMA Data to Network Card ⑤ Data Encapsulation and Transmission ⑥ Processing at Receiving End ⑦ Return Completion Message ⑧ Generate Completion Queue Element ⑧Application Polls CQE
Figure 2: Comparing 4-GPU topologies with NVLink and PCIe. In 4-GPU-NVLink, GPU0 and GPU1 have 40 GB/s peak bandwidth between them, as do GPU2 and GPU3. The other peer-to-peer connections have 20 GB/s peak bandwidthb12
Figure 3: (a)Average Time to Transfer 10GB of Data between GPUs (b)Average time to transfer 10GB of data (c)Mean RTT Between DPU and Tradition Method
Figure 4: Our design
Figure 5: (a)Socket Direct (b)GDR(GPU Direct Remote Direct Memory Access)
...and 2 more figures

Affordable HPC: Leveraging Small Clusters for Big Data and Graph Computing

TL;DR

Abstract

Affordable HPC: Leveraging Small Clusters for Big Data and Graph Computing

Authors

TL;DR

Abstract

Table of Contents

Figures (7)