Table of Contents
Fetching ...

Raising the Performance of the Tinker-HP Molecular Modeling Package [Article v1.0]

Luc-Henri Jolly, Alejandro Duran, Louis Lagardère, Jay W. Ponder, Pengyu Ren, Jean-Philip Piquemal

TL;DR

The paper addresses accelerating large-scale molecular dynamics with polarizable force fields on Intel AVX-512 capable CPUs by rewriting critical routines and reorganizing data. It presents a pedagogical optimization strategy for vectorizing Tinker-HP, detailing memory-layout restructuring, loop design, and reliance on MKL for vectorized math, achieving substantial performance gains. Across AMOEBA and CHARMM benchmarks, vectorized runs yield noteworthy speedups (roughly 1.4–2.0× on single cores and up to ~1.45–1.59× in parallel for AMOEBA, with CHARMM gains ~1.24–1.40×), and scalability remains strong up to thousands of cores before MPI/memory contention dominates. The work demonstrates a productive co-design of chemistry and HPC, provides a reusable framework for vectorizing complex MD codes on current and future architectures, and documents a path toward further improvements in release 1.2 and beyond.

Abstract

This living paper reviews the present High Performance Computing (HPC) capabilities of the Tinker-HP molecular modeling package. We focus here on the reference, double precision, massively parallel molecular dynamics engine present in Tinker-HP and dedicated to perform large scale simulations. We show how it can be adapted to recent Intel Central Processing Unit (CPU) petascale architectures. First, we discuss the new set of Intel Advanced Vector Extensions 512 (Intel AVX-512) instructions present in recent Intel processors (e.g., the Intel Xeon Scalable and Intel Xeon Phi 2nd generation processors) allowing for larger vectorization enhancements. These instructions constitute the central source of potential computational gains when using the latest processors, justifying important vectorization efforts for developers. We then briefly review the organization of the Tinker-HP code and identify the computational hotspots which require Intel AVX-512 optimization and we propose a general and optimal strategy to vectorize those particular parts of the code. We intended to present our optimization strategy in a pedagogical way so it could benefit to other researchers and students interested in gaining performances in their own software. Finally we present the performance enhancements obtained compared to the unoptimized code both sequentially and at the scaling limit in parallel for classical non-polarizable (CHARMM) and polarizable force fields (AMOEBA). This paper never ceases to be updated as we accumulate new data on the associated Github repository between new versions of this living paper.

Raising the Performance of the Tinker-HP Molecular Modeling Package [Article v1.0]

TL;DR

The paper addresses accelerating large-scale molecular dynamics with polarizable force fields on Intel AVX-512 capable CPUs by rewriting critical routines and reorganizing data. It presents a pedagogical optimization strategy for vectorizing Tinker-HP, detailing memory-layout restructuring, loop design, and reliance on MKL for vectorized math, achieving substantial performance gains. Across AMOEBA and CHARMM benchmarks, vectorized runs yield noteworthy speedups (roughly 1.4–2.0× on single cores and up to ~1.45–1.59× in parallel for AMOEBA, with CHARMM gains ~1.24–1.40×), and scalability remains strong up to thousands of cores before MPI/memory contention dominates. The work demonstrates a productive co-design of chemistry and HPC, provides a reusable framework for vectorizing complex MD codes on current and future architectures, and documents a path toward further improvements in release 1.2 and beyond.

Abstract

This living paper reviews the present High Performance Computing (HPC) capabilities of the Tinker-HP molecular modeling package. We focus here on the reference, double precision, massively parallel molecular dynamics engine present in Tinker-HP and dedicated to perform large scale simulations. We show how it can be adapted to recent Intel Central Processing Unit (CPU) petascale architectures. First, we discuss the new set of Intel Advanced Vector Extensions 512 (Intel AVX-512) instructions present in recent Intel processors (e.g., the Intel Xeon Scalable and Intel Xeon Phi 2nd generation processors) allowing for larger vectorization enhancements. These instructions constitute the central source of potential computational gains when using the latest processors, justifying important vectorization efforts for developers. We then briefly review the organization of the Tinker-HP code and identify the computational hotspots which require Intel AVX-512 optimization and we propose a general and optimal strategy to vectorize those particular parts of the code. We intended to present our optimization strategy in a pedagogical way so it could benefit to other researchers and students interested in gaining performances in their own software. Finally we present the performance enhancements obtained compared to the unoptimized code both sequentially and at the scaling limit in parallel for classical non-polarizable (CHARMM) and polarizable force fields (AMOEBA). This paper never ceases to be updated as we accumulate new data on the associated Github repository between new versions of this living paper.

Paper Structure

This paper contains 43 sections, 2 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Memory layout of a running process. Arrows give the directions in which the zones expand.
  • Figure 2: Schematic picture of 3 data layouts in memory. The double vertical separators show 64 bits boundary. The single ones show 32 bits boundary.
  • Figure 3: Performance gain for the STMV using Rel or Vec. The boost factor decreases from 1.59 to 1.57 when increasing the number of cores.
  • Figure 4: Performance gain for the ribosome using Rel or Vec. The boost factor decreases from 1.51 to 1.49 when increasing the number of cores.
  • Figure 5: Performance gain with CHARMM forces field for the Ubiquitin using Rel or Vec. The boost factor remains constant when increasing the number of cores.
  • ...and 2 more figures