Table of Contents
Fetching ...

Improving a Parallel C++ Intel AVX-512 SIMD Linear Genetic Programming Interpreter

William B. Langdon

TL;DR

The paper demonstrates Magpie's ability to automatically uncover small, correct AVX-512 SIMD optimizations for GPengine's parallel interpreter, translating manual AVX work into efficient, XML-driven mutations. It documents a workflow that uses Intel Intrinsics Guide-wrapped intrinsics, XML-based edits, and a Linux mprotect sandbox to safely evaluate mutations. Results show substantial speedups over the SSE baseline, achieving up to ~3.9× faster performance and 3.5 Giga GP/s, while highlighting practical challenges in compilation, equivalence detection, and test coverage. The work emphasizes reproducible hardware-aware optimization and discusses limitations and directions for broader transplantation and AVX exploration.

Abstract

We extend recent 256 SSE vector work to 512 AVX giving a four fold speedup. We use MAGPIE (Machine Automated General Performance Improvement via Evolution of software) to speedup a C++ linear genetic programming interpreter. Local search is provided with three alternative hand optimised codes, revision history and the Intel 512 bit AVX512VL documentation as C++ XML. Magpie is applied to the new Single Instruction Multiple Data (SIMD) parallel interpreter for Peter Nordin's linear genetic programming GPengine. Linux mprotect sandboxes whilst performance is given by perf instruction count. In both cases, in a matter of hours local search reliably sped up 114 or 310 lines of manually written parallel SIMD code for the Intel Advanced Vector Extensions (AVX) by 2 percent.

Improving a Parallel C++ Intel AVX-512 SIMD Linear Genetic Programming Interpreter

TL;DR

The paper demonstrates Magpie's ability to automatically uncover small, correct AVX-512 SIMD optimizations for GPengine's parallel interpreter, translating manual AVX work into efficient, XML-driven mutations. It documents a workflow that uses Intel Intrinsics Guide-wrapped intrinsics, XML-based edits, and a Linux mprotect sandbox to safely evaluate mutations. Results show substantial speedups over the SSE baseline, achieving up to ~3.9× faster performance and 3.5 Giga GP/s, while highlighting practical challenges in compilation, equivalence detection, and test coverage. The work emphasizes reproducible hardware-aware optimization and discusses limitations and directions for broader transplantation and AVX exploration.

Abstract

We extend recent 256 SSE vector work to 512 AVX giving a four fold speedup. We use MAGPIE (Machine Automated General Performance Improvement via Evolution of software) to speedup a C++ linear genetic programming interpreter. Local search is provided with three alternative hand optimised codes, revision history and the Intel 512 bit AVX512VL documentation as C++ XML. Magpie is applied to the new Single Instruction Multiple Data (SIMD) parallel interpreter for Peter Nordin's linear genetic programming GPengine. Linux mprotect sandboxes whilst performance is given by perf instruction count. In both cases, in a matter of hours local search reliably sped up 114 or 310 lines of manually written parallel SIMD code for the Intel Advanced Vector Extensions (AVX) by 2 percent.

Paper Structure

This paper contains 25 sections, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Four GPengine test programs each with four instructions
  • Figure 2: Distribution of 256 pairs of fitness input values. Notice concentration at edge cases 0, 1 and 255 (log vertical scale).
  • Figure 3: Sum of distribution of 256 output values across four test programs during fitness testing. Starting at the input (instruction 0, protected division, see Figure \ref{['fig:inputs']}), instructions 1 and 2 and values output by the 4 programs. (log vertical scale).
  • Figure 4: Entropy of distributions of 64 values for each of the four test programs (1 black, 2 purple, 3 green and 4 light blue). Starting at the inputs (protected division, Figure \ref{['fig:inputs']}), instructions 1 and 2 and values output by the 4 programs (Figure \ref{['fig:out4']}). Note due to int wrap around + - tend to loose little information, compared to multiply and protected division * /
  • Figure 5: The read-write I/O registers are surrounded by 4K byte buffers where either read or write access will cause an illegal access violation SegFault. Unused bytes are filled with 90 ($5\times 16 +10$, Z, 4 bits set 4 bits clear). After running the mutant, the test harness checks the pattern has not been disturbed. The division look up table and the program are also given PROT_READ as well as being similarly surrounded by 4KB guards.
  • ...and 6 more figures