Table of Contents
Fetching ...

cuGenOpt: A GPU-Accelerated General-Purpose Metaheuristic Framework for Combinatorial Optimization

Yuyang Liu

Abstract

Combinatorial optimization problems arise in logistics, scheduling, and resource allocation, yet existing approaches face a fundamental trade-off among generality, performance, and usability. We present cuGenOpt, a GPU-accelerated general-purpose metaheuristic framework that addresses all three dimensions simultaneously. At the engine level, cuGenOpt adopts a "one block evolves one solution" CUDA architecture with a unified encoding abstraction (permutation, binary, integer), a two-level adaptive operator selection mechanism, and hardware-aware resource management. At the extensibility level, a user-defined operator registration interface allows domain experts to inject problem-specific CUDA search operators. At the usability level, a JIT compilation pipeline exposes the framework as a pure-Python API, and an LLM-based modeling assistant converts natural-language problem descriptions into executable solver code. Experiments across five thematic suites on three GPU architectures (T4, V100, A800) show that cuGenOpt outperforms general MIP solvers by orders of magnitude, achieves competitive quality against specialized solvers on instances up to n=150, and attains 4.73% gap on TSP-442 within 30s. Twelve problem types spanning five encoding variants are solved to optimality. Framework-level optimizations cumulatively reduce pcb442 gap from 36% to 4.73% and boost VRPTW throughput by 75-81%. Code: https://github.com/L-yang-yang/cugenopt

cuGenOpt: A GPU-Accelerated General-Purpose Metaheuristic Framework for Combinatorial Optimization

Abstract

Combinatorial optimization problems arise in logistics, scheduling, and resource allocation, yet existing approaches face a fundamental trade-off among generality, performance, and usability. We present cuGenOpt, a GPU-accelerated general-purpose metaheuristic framework that addresses all three dimensions simultaneously. At the engine level, cuGenOpt adopts a "one block evolves one solution" CUDA architecture with a unified encoding abstraction (permutation, binary, integer), a two-level adaptive operator selection mechanism, and hardware-aware resource management. At the extensibility level, a user-defined operator registration interface allows domain experts to inject problem-specific CUDA search operators. At the usability level, a JIT compilation pipeline exposes the framework as a pure-Python API, and an LLM-based modeling assistant converts natural-language problem descriptions into executable solver code. Experiments across five thematic suites on three GPU architectures (T4, V100, A800) show that cuGenOpt outperforms general MIP solvers by orders of magnitude, achieves competitive quality against specialized solvers on instances up to n=150, and attains 4.73% gap on TSP-442 within 30s. Twelve problem types spanning five encoding variants are solved to optimality. Framework-level optimizations cumulatively reduce pcb442 gap from 36% to 4.73% and boost VRPTW throughput by 75-81%. Code: https://github.com/L-yang-yang/cugenopt
Paper Structure (76 sections, 4 equations, 9 figures, 23 tables, 2 algorithms)

This paper contains 76 sections, 4 equations, 9 figures, 23 tables, 2 algorithms.

Figures (9)

  • Figure 1: cuGenOpt system architecture. The framework provides three-layer abstraction: user interface (Python API and CUDA problem definition), core components (adaptive search and operator management), and GPU execution engine (L2-aware memory management and CUDA kernels).
  • Figure 2: GPU parallel execution model. The population is distributed across CUDA blocks (P blocks = population size). Within each block, 128 threads sample candidate moves in parallel, perform block-level reduction to find the best move, and thread 0 executes accept/reject decision. Shared memory holds current solution, operator weights, and reduction buffer.
  • Figure 3: Adaptive Operator Selection (AOS) mechanism. Each CUDA block maintains local statistics (usage count and improvement count) for each operator. Every 10 batches, weights are updated via EMA based on observed effectiveness. The two-level structure controls both K-step (number of operators per iteration) and sequence-level (specific operator) selection probabilities.
  • Figure 4: L2 cache-aware population sizing. The framework automatically adjusts population size to fit the working set into L2 cache, preventing cache thrashing. For TSP n=1000 (working set 3.8 MB), V100 (L2=6 MB) selects pop=32, while A800 (L2=40 MB) can support larger populations.
  • Figure 5: JIT compilation pipeline. User-provided CUDA code snippets (objective and constraints) are injected into a template, compiled via nvcc, and executed as a subprocess. SHA-256 hashing enables compilation caching. The entire process is transparent to users, appearing as a standard Python function call.
  • ...and 4 more figures

Theorems & Definitions (1)

  • Definition 1: Problem Profile