Table of Contents
Fetching ...

Analysis of the Performance of the Matrix Multiplication Algorithm on the Cirrus Supercomputer

Temitayo Adefemi

TL;DR

This study analyzes the performance of serial and parallel matrix multiplication on the Cirrus supercomputer, evaluating how thread- and process-based parallelization, cache tiling, and compiler optimizations influence throughput. It employs a direct C implementation on row-major data, uses the Box-Muller transform to generate normally distributed matrices, and compares GCC and Intel compilers across multiple matrix sizes. Core findings show that naïve multiplication scales roughly as $O(n^3)$, with significant speedups from OpenMP threading and substantial, albeit compiler-dependent, improvements from -O3 optimizations. The results inform practical optimization strategies for HPC matrix workloads and point toward hybrid tiled approaches for future large-scale performance gains.

Abstract

Matrix multiplication is integral to various scientific and engineering disciplines, including machine learning, image processing, and gaming. With the increasing data volumes in areas like machine learning, the demand for efficient parallel processing of large matrices has grown significantly.This study explores the performance of both serial and parallel matrix multiplication on the Cirrus supercomputer at the University of Edinburgh. The results demonstrate the scalability and efficiency of these methods, providing insights for optimizing matrixmultiplication in real-world applications.

Analysis of the Performance of the Matrix Multiplication Algorithm on the Cirrus Supercomputer

TL;DR

This study analyzes the performance of serial and parallel matrix multiplication on the Cirrus supercomputer, evaluating how thread- and process-based parallelization, cache tiling, and compiler optimizations influence throughput. It employs a direct C implementation on row-major data, uses the Box-Muller transform to generate normally distributed matrices, and compares GCC and Intel compilers across multiple matrix sizes. Core findings show that naïve multiplication scales roughly as , with significant speedups from OpenMP threading and substantial, albeit compiler-dependent, improvements from -O3 optimizations. The results inform practical optimization strategies for HPC matrix workloads and point toward hybrid tiled approaches for future large-scale performance gains.

Abstract

Matrix multiplication is integral to various scientific and engineering disciplines, including machine learning, image processing, and gaming. With the increasing data volumes in areas like machine learning, the demand for efficient parallel processing of large matrices has grown significantly.This study explores the performance of both serial and parallel matrix multiplication on the Cirrus supercomputer at the University of Edinburgh. The results demonstrate the scalability and efficiency of these methods, providing insights for optimizing matrixmultiplication in real-world applications.
Paper Structure (17 sections, 2 equations, 8 figures, 2 tables)

This paper contains 17 sections, 2 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Matrix Multiplication
  • Figure 2: Single vs Multi-threaded programs
  • Figure 3: Illustration of Processes
  • Figure 4: Communication Needed for Different Algorithms Relative to Tesseract
  • Figure 5: Pseudocode for the Box-Muller transform in C.
  • ...and 3 more figures