Lectures on Parallel Computing
Jesper Larsson Träff
TL;DR
This collection of lecture notes systematically builds a foundation for parallel computing, starting from core theoretical models (PRAM, BSP, Flynn's taxonomy) and performance metrics (time, speed-up, efficiency, and scalability) to practical patterns for structuring parallel work (DAGs, pipelines, stencils, domain decomposition, and reductions). It then dives into shared-memory programming with pthreads and OpenMP, detailing models, synchronization primitives, and patterns for loops, tasks, reductions, and caching effects, followed by a thorough treatment of distributed memory paradigms with MPI, including communicators, point-to-point and collective communication, one-sided operations, and datatype mechanisms. The notes emphasize cost-, work-, and iso-efficiency-based analyses, and repeatedly connect theoretical bounds (e.g., Amdahl's Law, Brent's Theorem, Master Theorem) to practical algorithm design, benchmarking, and performance considerations across memory hierarchies. A central theme is the design of cost- and work-optimal parallel algorithms, with concrete examples such as merging, prefix sums, and matrix-matrix multiplications, plus extensive exercises to reinforce both conceptual understanding and implementation skills. The work serves as a comprehensive, rigorous reference for HPC education, providing both the methodological toolkit and the practical guidance necessary to develop and benchmark scalable parallel software on modern multi-core and multi-node systems.
Abstract
These lecture notes are designed to accompany an imaginary, virtual, undergraduate, one or two semester course on fundamentals of Parallel Computing as well as to serve as background and reference for graduate courses on High-Performance Computing, parallel algorithms and shared-memory multiprocessor programming. They introduce theoretical concepts and tools for expressing, analyzing and judging parallel algorithms and, in detail, cover the two most widely used concrete frameworks OpenMP and MPI as well as the threading interface pthreads for writing parallel programs for either shared or distributed memory parallel computers with emphasis on general concepts and principles. Code examples are given in a C-like style and many are actual, correct C code. The lecture notes deliberately do not cover GPU architectures and GPU programming, but the general concerns, guidelines and principles (time, work, cost, efficiency, scalability, memory structure and bandwidth) will be just as relevant for efficiently utilizing various GPU architectures. Likewise, the lecture notes focus on deterministic algorithms only and do not use randomization. The student of this material will find it instructive to take the time to understand concepts and algorithms visually. The exercises can be used for self-study and as inspiration for small implementation projects in OpenMP and MPI that can and should accompany any serious course on Parallel Computing. The student will benefit from actually implementing and carefully benchmarking the suggested algorithms on the parallel computing system that may or should be made available as part of such a Parallel Computing course. In class, the exercises can be used as basis for hand-ins and small programming projects for which sufficient, additional detail and precision should be provided by the instructor.
