Table of Contents
Fetching ...

Accelerating Fortran Codes: A Method for Integrating Coarray Fortran with CUDA Fortran and OpenMP

James McKevitt, Eduard I. Vorobyov, Igor Kulikov

TL;DR

This research focuses on innovating and refining a parallel programming methodology that fuses the strengths of Intel Coarray Fortran, Nvidia CUDA Fortran, and OpenMP for distributed memory parallelism, high-speed GPU acceleration and shared memory parallelism respectively.

Abstract

Fortran's prominence in scientific computing requires strategies to ensure both that legacy codes are efficient on high-performance computing systems, and that the language remains attractive for the development of new high-performance codes. Coarray Fortran (CAF), part of the Fortran 2008 standard introduced for parallel programming, facilitates distributed memory parallelism with a syntax familiar to Fortran programmers, simplifying the transition from single-processor to multi-processor coding. This research focuses on innovating and refining a parallel programming methodology that fuses the strengths of Intel Coarray Fortran, Nvidia CUDA Fortran, and OpenMP for distributed memory parallelism, high-speed GPU acceleration and shared memory parallelism respectively. We consider the management of pageable and pinned memory, CPU-GPU affinity in NUMA multiprocessors, and robust compiler interfacing with speed optimisation. We demonstrate our method through its application to a parallelised Poisson solver and compare the methodology, implementation, and scaling performance to that of the Message Passing Interface (MPI), finding CAF offers similar speeds with easier implementation. For new codes, this approach offers a faster route to optimised parallel computing. For legacy codes, it eases the transition to parallel computing, allowing their transformation into scalable, high-performance computing applications without the need for extensive re-design or additional syntax.

Accelerating Fortran Codes: A Method for Integrating Coarray Fortran with CUDA Fortran and OpenMP

TL;DR

This research focuses on innovating and refining a parallel programming methodology that fuses the strengths of Intel Coarray Fortran, Nvidia CUDA Fortran, and OpenMP for distributed memory parallelism, high-speed GPU acceleration and shared memory parallelism respectively.

Abstract

Fortran's prominence in scientific computing requires strategies to ensure both that legacy codes are efficient on high-performance computing systems, and that the language remains attractive for the development of new high-performance codes. Coarray Fortran (CAF), part of the Fortran 2008 standard introduced for parallel programming, facilitates distributed memory parallelism with a syntax familiar to Fortran programmers, simplifying the transition from single-processor to multi-processor coding. This research focuses on innovating and refining a parallel programming methodology that fuses the strengths of Intel Coarray Fortran, Nvidia CUDA Fortran, and OpenMP for distributed memory parallelism, high-speed GPU acceleration and shared memory parallelism respectively. We consider the management of pageable and pinned memory, CPU-GPU affinity in NUMA multiprocessors, and robust compiler interfacing with speed optimisation. We demonstrate our method through its application to a parallelised Poisson solver and compare the methodology, implementation, and scaling performance to that of the Message Passing Interface (MPI), finding CAF offers similar speeds with easier implementation. For new codes, this approach offers a faster route to optimised parallel computing. For legacy codes, it eases the transition to parallel computing, allowing their transformation into scalable, high-performance computing applications without the need for extensive re-design or additional syntax.
Paper Structure (26 sections, 13 figures, 1 table)

This paper contains 26 sections, 13 figures, 1 table.

Figures (13)

  • Figure 1: The relative transfer speeds of moving variables between various parts of the host memory and the device memory, where green represents a fast to negligible speed, and yellow represents a slower speed. As CUDA Fortran is used for GPU operations, only Nvidia code can be used to define device variables. MPI is added for comparison.
  • Figure 2: Configuration of physical memory and the procedure required for retrieval across the partition allowing pinned memory for fast device transfers but slow partition crossing by copying memory. In this case, a variable has been changed by the Intel execution stream before this procedure starts, and so the Nvidia execution stream is required to retrieve an updated copy of this array before operating on it, transferring it to the device and performing further GPU-accelerated operations. The Intel execution stream then takes over, retrieving the value(s) of the updated variable, performing its operations, and then allowing for transfer to other host memory spaces in the distributed memory. This host-host transfer must occur every time a device operation is performed with a coarrayed variable.
  • Figure 3: Configuration of physical memory and the procedure required for retrieval across the partition using pointers allowing for no transfer overhead in host-host transfers, but requiring the use of pageable memory for device transfers and incurring an overhead. In this case, the linking is performed only once during the first stages of the running of the code, meaning both host execution streams are always accessing the same version of host variables.
  • Figure 4: Configuration of physical memory and the procedure required for retrieval across the partition using pointers allowing for no transfer overhead in host-host transfers, and allowing for the use of pinned memory for device transfers, incurring the minimal overhead. As in the previous case, the linking is performed only once during the first stages of the running of the code, meaning both host execution streams are always accessing the same version of host variables.
  • Figure 5: Architectural schematic detailing the interconnect between GPUs and NUMA nodes within a dual-socket configuration.
  • ...and 8 more figures