Analysis of the Performance of the Matrix Multiplication Algorithm on the Cirrus Supercomputer

Temitayo Adefemi

Analysis of the Performance of the Matrix Multiplication Algorithm on the Cirrus Supercomputer

Temitayo Adefemi

TL;DR

This study analyzes the performance of serial and parallel matrix multiplication on the Cirrus supercomputer, evaluating how thread- and process-based parallelization, cache tiling, and compiler optimizations influence throughput. It employs a direct C implementation on row-major data, uses the Box-Muller transform to generate normally distributed matrices, and compares GCC and Intel compilers across multiple matrix sizes. Core findings show that naïve multiplication scales roughly as $O(n^3)$, with significant speedups from OpenMP threading and substantial, albeit compiler-dependent, improvements from -O3 optimizations. The results inform practical optimization strategies for HPC matrix workloads and point toward hybrid tiled approaches for future large-scale performance gains.

Abstract

Matrix multiplication is integral to various scientific and engineering disciplines, including machine learning, image processing, and gaming. With the increasing data volumes in areas like machine learning, the demand for efficient parallel processing of large matrices has grown significantly.This study explores the performance of both serial and parallel matrix multiplication on the Cirrus supercomputer at the University of Edinburgh. The results demonstrate the scalability and efficiency of these methods, providing insights for optimizing matrixmultiplication in real-world applications.

Analysis of the Performance of the Matrix Multiplication Algorithm on the Cirrus Supercomputer

TL;DR

, with significant speedups from OpenMP threading and substantial, albeit compiler-dependent, improvements from -O3 optimizations. The results inform practical optimization strategies for HPC matrix workloads and point toward hybrid tiled approaches for future large-scale performance gains.

Abstract

Paper Structure (17 sections, 2 equations, 8 figures, 2 tables)

This paper contains 17 sections, 2 equations, 8 figures, 2 tables.

Introduction
Literature Review
Methodology for Parallelising Matrix Multiplication
Column Prefetching
Cache Tiling
Parallelisation Paradigms
Threads
Processes
Experimental Dataset
Experimental Environment
Results & Analysis
Cache Efficiency
Pipelining
Compiler Optimizations
CPU
...and 2 more sections

Figures (8)

Figure 1: Matrix Multiplication
Figure 2: Single vs Multi-threaded programs
Figure 3: Illustration of Processes
Figure 4: Communication Needed for Different Algorithms Relative to Tesseract
Figure 5: Pseudocode for the Box-Muller transform in C.
...and 3 more figures

Analysis of the Performance of the Matrix Multiplication Algorithm on the Cirrus Supercomputer

TL;DR

Abstract

Analysis of the Performance of the Matrix Multiplication Algorithm on the Cirrus Supercomputer

Authors

TL;DR

Abstract

Table of Contents

Figures (8)