Table of Contents
Fetching ...

Cost-effective Deep Learning Infrastructure with NVIDIA GPU

Aatiz Ghimire, Shahnawaz Alam, Siman Giri, Madhav Prasad Ghimire

TL;DR

The paper tackles the high cost and privacy concerns of cloud GPUs in resource-constrained settings by designing a cost-efficient four-GTX-1650 Beowulf-style cluster using Rocky Linux, Slurm, and open-source tools. It documents a comprehensive hardware/software stack, security hardening, centralized identity management, and MPI-based distributed computing, validating single-node CUDA workloads and SLURM-enabled task scheduling despite the GPUs not clustering across nodes. The findings demonstrate tangible cost savings, data-control, and operational independence, while identifying limitations in inter-node GPU clustering and proposing future upgrades such as faster interconnects and HPC GPUs. Overall, the work provides a practical blueprint for deploying accessible, in-house DL/HPC infrastructure in developing regions.

Abstract

The growing demand for computational power is driven by advancements in deep learning, the increasing need for big data processing, and the requirements of scientific simulations for academic and research purposes. Developing countries like Nepal often struggle with the resources needed to invest in new and better hardware for these purposes. However, optimizing and building on existing technology can still meet these computing demands effectively. To address these needs, we built a cluster using four NVIDIA GeForce GTX 1650 GPUs. The cluster consists of four nodes: one master node that controls and manages the entire cluster, and three compute nodes dedicated to processing tasks. The master node is equipped with all necessary software for package management, resource scheduling, and deployment, such as Anaconda and Slurm. In addition, a Network File Storage (NFS) system was integrated to provide the additional storage required by the cluster. Given that the cluster is accessible via ssh by a public domain address, which poses significant cybersecurity risks, we implemented fail2ban to mitigate brute force attacks and enhance security. Despite the continuous challenges encountered during the design and implementation process, this project demonstrates how powerful computational clusters can be built to handle resource-intensive tasks in various demanding fields.

Cost-effective Deep Learning Infrastructure with NVIDIA GPU

TL;DR

The paper tackles the high cost and privacy concerns of cloud GPUs in resource-constrained settings by designing a cost-efficient four-GTX-1650 Beowulf-style cluster using Rocky Linux, Slurm, and open-source tools. It documents a comprehensive hardware/software stack, security hardening, centralized identity management, and MPI-based distributed computing, validating single-node CUDA workloads and SLURM-enabled task scheduling despite the GPUs not clustering across nodes. The findings demonstrate tangible cost savings, data-control, and operational independence, while identifying limitations in inter-node GPU clustering and proposing future upgrades such as faster interconnects and HPC GPUs. Overall, the work provides a practical blueprint for deploying accessible, in-house DL/HPC infrastructure in developing regions.

Abstract

The growing demand for computational power is driven by advancements in deep learning, the increasing need for big data processing, and the requirements of scientific simulations for academic and research purposes. Developing countries like Nepal often struggle with the resources needed to invest in new and better hardware for these purposes. However, optimizing and building on existing technology can still meet these computing demands effectively. To address these needs, we built a cluster using four NVIDIA GeForce GTX 1650 GPUs. The cluster consists of four nodes: one master node that controls and manages the entire cluster, and three compute nodes dedicated to processing tasks. The master node is equipped with all necessary software for package management, resource scheduling, and deployment, such as Anaconda and Slurm. In addition, a Network File Storage (NFS) system was integrated to provide the additional storage required by the cluster. Given that the cluster is accessible via ssh by a public domain address, which poses significant cybersecurity risks, we implemented fail2ban to mitigate brute force attacks and enhance security. Despite the continuous challenges encountered during the design and implementation process, this project demonstrates how powerful computational clusters can be built to handle resource-intensive tasks in various demanding fields.

Paper Structure

This paper contains 7 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Cluster Network Configuration
  • Figure 2: Cluster Login Screen via SSH
  • Figure 3: Slurm Command sinfo for listing cluster
  • Figure 4: Deep Learning Cluster Installation Flowchart
  • Figure 5: Failed Login attempts in cluster
  • ...and 1 more figures