Table of Contents
Fetching ...

Enabling Population-Level Parallelism in Tree-Based Genetic Programming for GPU Acceleration

Zhihong Wu, Lishuang Wang, Kebin Sun, Zhuozhao Li, Ran Cheng

TL;DR

EvoGP tackles the scalability bottlenecks of Tree-based Genetic Programming on GPUs by introducing a tensorized, fixed-shape encoding for heterogeneous trees and a unified subtree-exchange kernel that enables on-device genetic operations. It employs an adaptive parallelism strategy that blends intra- and inter-individual execution, switching based on dataset size to sustain high GPU utilization. Integrated as PyTorch custom operators, EvoGP delivers seamless Python interoperability and supports multi-output tasks via Modi nodes, extending TGP to domains like reinforcement learning and robotics. Empirical results show EvoGP achieving over $10^{11}$ GPops/s throughput with substantial speedups against existing GPU and CPU libraries while maintaining accuracy and scalability for large populations. The framework’s open-source release and broad problem support position EvoGP as a practical, high-performance tool for interpretable evolutionary AI on modern accelerators.

Abstract

Tree-based Genetic Programming (TGP) is a widely used evolutionary algorithm for tasks such as symbolic regression, classification, and robotic control. Due to the intensive computational demands of running TGP, GPU acceleration is crucial for achieving scalable performance. However, efficient GPU-based execution of TGP remains challenging, primarily due to three core issues: (1) the structural heterogeneity of program individuals, (2) the complexity of integrating multiple levels of parallelism, and (3) the incompatibility between high-performance CUDA execution and flexible Python-based environments. To address these issues, we propose EvoGP, a high-performance framework tailored for GPU acceleration of TGP via population-level parallel execution. First, EvoGP introduces a tensorized representation that encodes variable-sized trees into fixed-shape, memory-aligned arrays, enabling uniform memory access and parallel computation across diverse individuals. Second, EvoGP adopts an adaptive parallelism strategy that dynamically combines intra- and inter-individual parallelism based on dataset size, ensuring high GPU utilization across a broad spectrum of tasks. Third, EvoGP embeds custom CUDA kernels into the PyTorch runtime, achieving seamless integration with Python-based environments such as Gym, MuJoCo, Brax, and Genesis. Experiments show that EvoGP attains a peak throughput exceeding $10^{11}$ GPops/s, with speedups of up to $528\times$ over GPU-based TGP implementations and $18\times$ over the fastest CPU-based libraries, while maintaining comparable accuracy and improved scalability across large population sizes. EvoGP is open source and accessible at: https://github.com/EMI-Group/evogp.

Enabling Population-Level Parallelism in Tree-Based Genetic Programming for GPU Acceleration

TL;DR

EvoGP tackles the scalability bottlenecks of Tree-based Genetic Programming on GPUs by introducing a tensorized, fixed-shape encoding for heterogeneous trees and a unified subtree-exchange kernel that enables on-device genetic operations. It employs an adaptive parallelism strategy that blends intra- and inter-individual execution, switching based on dataset size to sustain high GPU utilization. Integrated as PyTorch custom operators, EvoGP delivers seamless Python interoperability and supports multi-output tasks via Modi nodes, extending TGP to domains like reinforcement learning and robotics. Empirical results show EvoGP achieving over GPops/s throughput with substantial speedups against existing GPU and CPU libraries while maintaining accuracy and scalability for large populations. The framework’s open-source release and broad problem support position EvoGP as a practical, high-performance tool for interpretable evolutionary AI on modern accelerators.

Abstract

Tree-based Genetic Programming (TGP) is a widely used evolutionary algorithm for tasks such as symbolic regression, classification, and robotic control. Due to the intensive computational demands of running TGP, GPU acceleration is crucial for achieving scalable performance. However, efficient GPU-based execution of TGP remains challenging, primarily due to three core issues: (1) the structural heterogeneity of program individuals, (2) the complexity of integrating multiple levels of parallelism, and (3) the incompatibility between high-performance CUDA execution and flexible Python-based environments. To address these issues, we propose EvoGP, a high-performance framework tailored for GPU acceleration of TGP via population-level parallel execution. First, EvoGP introduces a tensorized representation that encodes variable-sized trees into fixed-shape, memory-aligned arrays, enabling uniform memory access and parallel computation across diverse individuals. Second, EvoGP adopts an adaptive parallelism strategy that dynamically combines intra- and inter-individual parallelism based on dataset size, ensuring high GPU utilization across a broad spectrum of tasks. Third, EvoGP embeds custom CUDA kernels into the PyTorch runtime, achieving seamless integration with Python-based environments such as Gym, MuJoCo, Brax, and Genesis. Experiments show that EvoGP attains a peak throughput exceeding GPops/s, with speedups of up to over GPU-based TGP implementations and over the fastest CPU-based libraries, while maintaining comparable accuracy and improved scalability across large population sizes. EvoGP is open source and accessible at: https://github.com/EMI-Group/evogp.

Paper Structure

This paper contains 22 sections, 17 equations, 12 figures, 15 tables, 1 algorithm.

Figures (12)

  • Figure 1: An example of a tree. In TGP, a computational process is represented as a tree structure. The tree shown in the figure illustrates the sigmoid function. In the figure, the orange nodes represent function nodes, the red nodes represent constant nodes, and the green nodes represent variable nodes.
  • Figure 2: Illustration of the tree encoding process. In tensorized encoding, a tree is encoded into three tensors: types, values, and subtree sizes. We use NaN padding to ensure that these tensors reach a uniform maximum size, allowing trees of different structures to be encoded into tensors of the same shape. This enables the encoding of an entire population of trees into three batched tensors.
  • Figure 3: An illustration of genetic operations in TGP. The upper part depicts the structural modifications of trees. The red box highlights the crossover operation, where two parent trees exchange subtrees to generate a new tree. The green box illustrates the mutation process, where a subtree of a given tree is replaced with a newly generated subtree. The lower part of the figure demonstrates the corresponding transformations in the tensor representation of trees for both crossover and mutation operations.
  • Figure 4: Illustration of CUDA-based parallelism in EvoGP. Left: Population parallelism for generation, crossover, mutation and inference kernel, where each individual is assigned to a separate thread. Right: SR fitness evaluation, with hybrid parallelism (upper) computing all trees in one kernel launch, while data parallelism (lower) processes each tree separately across multiple launches. The system adaptively switches between the two modes based on dataset size.
  • Figure 5: Illustration of the EvoGP architecture. EvoGP consists of two main components: Algorithm and Problem. The Algorithm module implements the TGP algorithm along with multiple operation variants. The Problem module integrates various benchmark problems for user evaluation. Within these components, we design CUDA kernels to enable parallel acceleration of computational processes. These CUDA kernels are seamlessly integrated into Python using PyTorch's custom operator functionality.
  • ...and 7 more figures