Study on the Particle Sorting Performance for Reactor Monte Carlo Neutron Transport on Apple Unified Memory GPUs

Changyuan Liu

Study on the Particle Sorting Performance for Reactor Monte Carlo Neutron Transport on Apple Unified Memory GPUs

Changyuan Liu

TL;DR

The finding is that for the Apple M2 max and M3 max chip, sorting on CPU leads to better performance per power than sorting on GPU for the ExaSMR whole core benchmark problems and the HTR-10 high temperature gas reactor fuel pebble problem.

Abstract

In simulation of nuclear reactor physics using the Monte Carlo neutron transport method on GPUs, the sorting of particles plays a significant role in performance of calculation. Traditionally, CPUs and GPUs are separated devices connected at low data transfer rate and high data transfer latency. Emerging computing chips tend to integrate CPUs and GPUs. One example is the Apple silicon chips with unified memory. Such unified memory chips have opened doors for new strategies of collaboration between CPUs and GPUs for Monte Carlo neutron transport. Sorting particle on CPU and transport on GPU is an example of such new strategy, which has been suffering the high CPU-GPU data transfer latency on the traditional devices with separated CPU and GPU. The finding is that for the Apple M2 max chip, sorting on CPU leads to better performance per power than sorting on GPU for the ExaSMR whole core benchmark problems and the HTR-10 high temperature gas reactor fuel pebble problem. The partially sorted particle order has been identified to contribute to the higher performance with CPU sort than GPU. The in-house code using both CPU and GPU achieves 7.5 times power efficiency that of OpenMC on CPU for ExaSMR whole core benchmark with depleted fuel, and 150 times for HTR-10 fuel pebble benchmark with depleted fuel.

Study on the Particle Sorting Performance for Reactor Monte Carlo Neutron Transport on Apple Unified Memory GPUs

TL;DR

Abstract

Paper Structure (19 sections, 8 figures, 10 tables)

This paper contains 19 sections, 8 figures, 10 tables.

Introduction
Development on Apple Silicon as a Unified Memory Device
Apple M2 Max Chip
Objective-C/Swift Programming Languages and Frameworks
Metal Shading Language & Framework
Apple GPU Programming Patterns
Sorting Algorithms
Summary of Sorting Algorithms on CPU & GPU
Performance of Sorting on Apple Chip
Random Integers
Partially Sorted Integers
Sorting Strategy for Monte Carlo Neutron Transport
Reactor Simulation Benchmarks
Simulation Configuration
Verification: VERA Pincell & Assembly Benchmark Problem
...and 4 more sections

Figures (8)

Figure 1: A snapshot of the design of some recent merged CPU and GPU chips .
Figure 2: A snapshot apple_m2max (left) and a sketch of the design (right) of Apple M2 Max chip. I-Cache stands for instruction cache, and D-Cache stands for data cache. Avalanche and Blizzard are architecture design code names.
Figure 3: A sketch of application development in Objective-C & Swift programming language on Apple devices.
Figure 4: A sketch of CPU-GPU program compilation scheme on Nvidia and Apple GPU devices.
Figure 5: Programming patterns for Apple GPU.
...and 3 more figures

Study on the Particle Sorting Performance for Reactor Monte Carlo Neutron Transport on Apple Unified Memory GPUs

TL;DR

Abstract

Study on the Particle Sorting Performance for Reactor Monte Carlo Neutron Transport on Apple Unified Memory GPUs

Authors

TL;DR

Abstract

Table of Contents

Figures (8)