Enabling Population-Level Parallelism in Tree-Based Genetic Programming for GPU Acceleration
Zhihong Wu, Lishuang Wang, Kebin Sun, Zhuozhao Li, Ran Cheng
TL;DR
EvoGP tackles the scalability bottlenecks of Tree-based Genetic Programming on GPUs by introducing a tensorized, fixed-shape encoding for heterogeneous trees and a unified subtree-exchange kernel that enables on-device genetic operations. It employs an adaptive parallelism strategy that blends intra- and inter-individual execution, switching based on dataset size to sustain high GPU utilization. Integrated as PyTorch custom operators, EvoGP delivers seamless Python interoperability and supports multi-output tasks via Modi nodes, extending TGP to domains like reinforcement learning and robotics. Empirical results show EvoGP achieving over $10^{11}$ GPops/s throughput with substantial speedups against existing GPU and CPU libraries while maintaining accuracy and scalability for large populations. The framework’s open-source release and broad problem support position EvoGP as a practical, high-performance tool for interpretable evolutionary AI on modern accelerators.
Abstract
Tree-based Genetic Programming (TGP) is a widely used evolutionary algorithm for tasks such as symbolic regression, classification, and robotic control. Due to the intensive computational demands of running TGP, GPU acceleration is crucial for achieving scalable performance. However, efficient GPU-based execution of TGP remains challenging, primarily due to three core issues: (1) the structural heterogeneity of program individuals, (2) the complexity of integrating multiple levels of parallelism, and (3) the incompatibility between high-performance CUDA execution and flexible Python-based environments. To address these issues, we propose EvoGP, a high-performance framework tailored for GPU acceleration of TGP via population-level parallel execution. First, EvoGP introduces a tensorized representation that encodes variable-sized trees into fixed-shape, memory-aligned arrays, enabling uniform memory access and parallel computation across diverse individuals. Second, EvoGP adopts an adaptive parallelism strategy that dynamically combines intra- and inter-individual parallelism based on dataset size, ensuring high GPU utilization across a broad spectrum of tasks. Third, EvoGP embeds custom CUDA kernels into the PyTorch runtime, achieving seamless integration with Python-based environments such as Gym, MuJoCo, Brax, and Genesis. Experiments show that EvoGP attains a peak throughput exceeding $10^{11}$ GPops/s, with speedups of up to $528\times$ over GPU-based TGP implementations and $18\times$ over the fastest CPU-based libraries, while maintaining comparable accuracy and improved scalability across large population sizes. EvoGP is open source and accessible at: https://github.com/EMI-Group/evogp.
