Table of Contents
Fetching ...

GROMACS Unplugged: How Power Capping and Frequency Shapes Performance on GPUs

Ayesha Afzal, Anna Kahler, Georg Hager, Gerhard Wellein

TL;DR

This study analyzes how GPU clock frequency and power capping influence GROMACS performance on four NVIDIA GPUs (A40, A100, L4, L40) using six biomolecular benchmarks and two synthetic workloads (Pi Solver and BabelStream). By combining frequency-tuning experiments with power-cap analyses, the authors map throughput in MD workloads (ns/day) across architectures, revealing that small systems are highly frequency-sensitive while large systems become memory bandwidth-limited; high-end GPUs like the A100 maintain near-peak performance under reasonable power caps. The Pi Solver demonstrates ideal compute scalability with frequency, whereas BabelStream highlights memory-bound limits, providing context for interpreting MD performance. The results offer practical guidance for hardware selection and tuning in large-scale MD workflows under power constraints and establish a reproducible benchmarking framework for energy-aware MD computing.

Abstract

Molecular dynamics simulations are essential tools in computational biophysics, but their performance depend heavily on hardware choices and configuration. In this work, we presents a comprehensive performance analysis of four NVIDIA GPU accelerators -- A40, A100, L4, and L40 -- using six representative GROMACS biomolecular workloads alongside two synthetic benchmarks: Pi Solver (compute bound) and STREAM Triad (memory bound). We investigate how performance scales with GPU graphics clock frequency and how workloads respond to power capping. The two synthetic benchmarks define the extremes of frequency scaling: Pi Solver shows ideal compute scalability, while STREAM Triad reveals memory bandwidth limits -- framing GROMACS's performance in context. Our results reveal distinct frequency scaling behaviors: Smaller GROMACS systems exhibit strong frequency sensitivity, while larger systems saturate quickly, becoming increasingly memory bound. Under power capping, performance remains stable until architecture- and workload-specific thresholds are reached, with high-end GPUs like the A100 maintaining near-maximum performance even under reduced power budgets. Our findings provide practical guidance for selecting GPU hardware and optimizing GROMACS performance for large-scale MD workflows under power constraints.

GROMACS Unplugged: How Power Capping and Frequency Shapes Performance on GPUs

TL;DR

This study analyzes how GPU clock frequency and power capping influence GROMACS performance on four NVIDIA GPUs (A40, A100, L4, L40) using six biomolecular benchmarks and two synthetic workloads (Pi Solver and BabelStream). By combining frequency-tuning experiments with power-cap analyses, the authors map throughput in MD workloads (ns/day) across architectures, revealing that small systems are highly frequency-sensitive while large systems become memory bandwidth-limited; high-end GPUs like the A100 maintain near-peak performance under reasonable power caps. The Pi Solver demonstrates ideal compute scalability with frequency, whereas BabelStream highlights memory-bound limits, providing context for interpreting MD performance. The results offer practical guidance for hardware selection and tuning in large-scale MD workflows under power constraints and establish a reproducible benchmarking framework for energy-aware MD computing.

Abstract

Molecular dynamics simulations are essential tools in computational biophysics, but their performance depend heavily on hardware choices and configuration. In this work, we presents a comprehensive performance analysis of four NVIDIA GPU accelerators -- A40, A100, L4, and L40 -- using six representative GROMACS biomolecular workloads alongside two synthetic benchmarks: Pi Solver (compute bound) and STREAM Triad (memory bound). We investigate how performance scales with GPU graphics clock frequency and how workloads respond to power capping. The two synthetic benchmarks define the extremes of frequency scaling: Pi Solver shows ideal compute scalability, while STREAM Triad reveals memory bandwidth limits -- framing GROMACS's performance in context. Our results reveal distinct frequency scaling behaviors: Smaller GROMACS systems exhibit strong frequency sensitivity, while larger systems saturate quickly, becoming increasingly memory bound. Under power capping, performance remains stable until architecture- and workload-specific thresholds are reached, with high-end GPUs like the A100 maintaining near-maximum performance even under reduced power budgets. Our findings provide practical guidance for selecting GPU hardware and optimizing GROMACS performance for large-scale MD workflows under power constraints.

Paper Structure

This paper contains 25 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Average performance (ns/day) as a function of GPU graphics clock frequency, measured at the maximum memory frequency setting, for various accelerators across six GROMACS benchmarks.
  • Figure 2: Comparison of average throughput (ns/day) across six biomolecular systems at maximum GPU memory frequency, i.e., 7.251 GHz (A40), 1.215 GHz (A100), 6.251 GHz (L4) and 9.001 GHz (L40).
  • Figure 3: Average performance as a function of graphics frequency for various accelerators at the maximum memory frequency setting, across the Pi solver and STREAM TRIAD benchmarks.
  • Figure 4: Average performance (ns/day) as a function of GPU power cap, measured at the maximum memory frequency setting of each accelerator, for various accelerators across six GROMACS benchmarks.
  • Figure 5: Average performance as a function of power cap for various accelerators at the maximum memory frequency setting, across the Pi solver and STREAM TRIAD benchmarks.