Table of Contents
Fetching ...

GPU-Accelerated Counterfactual Regret Minimization

Juho Kim

TL;DR

This work tackles the computational bottleneck of counterfactual regret minimization (CFR) in large imperfect-information games by reexpressing CFR as a sequence of dense and sparse matrix and vector operations, enabling GPU-wide parallelism at the cost of higher memory usage. The authors implement a GraphBLAS-inspired framework that encodes the game tree with adjacency, level-graph, and masking matrices, and perform the CFR update steps through linear-algebra recurrences rather than recursive tree traversal. Across 20 OpenSpiel-discrete games, the GPU-accelerated CFR demonstrates orders-of-magnitude speedups over the Python baseline (up to 401.2×) and substantial gains over the C++ baseline (up to 203.6× for large games), with additional insights into memory tradeoffs and precision effects. The approach shows practical potential for solving larger games faster and provides a foundation for future enhancements, including integration of CFR variants and pruning strategies within the matrix-based framework.

Abstract

Counterfactual regret minimization is a family of algorithms of no-regret learning dynamics capable of solving large-scale imperfect information games. We propose implementing this algorithm as a series of dense and sparse matrix and vector operations, thereby making it highly parallelizable for a graphical processing unit, at a cost of higher memory usage. Our experiments show that our implementation performs up to about 401.2 times faster than OpenSpiel's Python implementation and, on an expanded set of games, up to about 203.6 times faster than OpenSpiel's C++ implementation and the speedup becomes more pronounced as the size of the game being solved grows.

GPU-Accelerated Counterfactual Regret Minimization

TL;DR

This work tackles the computational bottleneck of counterfactual regret minimization (CFR) in large imperfect-information games by reexpressing CFR as a sequence of dense and sparse matrix and vector operations, enabling GPU-wide parallelism at the cost of higher memory usage. The authors implement a GraphBLAS-inspired framework that encodes the game tree with adjacency, level-graph, and masking matrices, and perform the CFR update steps through linear-algebra recurrences rather than recursive tree traversal. Across 20 OpenSpiel-discrete games, the GPU-accelerated CFR demonstrates orders-of-magnitude speedups over the Python baseline (up to 401.2×) and substantial gains over the C++ baseline (up to 203.6× for large games), with additional insights into memory tradeoffs and precision effects. The approach shows practical potential for solving larger games faster and provides a foundation for future enhancements, including integration of CFR variants and pruning strategies within the matrix-based framework.

Abstract

Counterfactual regret minimization is a family of algorithms of no-regret learning dynamics capable of solving large-scale imperfect information games. We propose implementing this algorithm as a series of dense and sparse matrix and vector operations, thereby making it highly parallelizable for a graphical processing unit, at a cost of higher memory usage. Our experiments show that our implementation performs up to about 401.2 times faster than OpenSpiel's Python implementation and, on an expanded set of games, up to about 203.6 times faster than OpenSpiel's C++ implementation and the speedup becomes more pronounced as the size of the game being solved grows.
Paper Structure (25 sections, 41 equations, 3 figures, 9 tables)

This paper contains 25 sections, 41 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: A log-log graph showing the average CFR iteration runtime with respect to the game size for both experiments. The four lines show the runtimes of the four benchmarked implementations. Note that the iteration time of each game does not depend solely on the number of nodes -- the number of players and the number of infosets play a sizable role as well. In addition, for OpenSpiel implementations, the efficiency of how the game logic is implemented also matters as, on each iteration, their implementations traverse the game tree by generating new states online.
  • Figure 2: A log-log graph showing the total allocated CUDA memory by our GPU implementation for each game tested in Experiment 1.
  • Figure 3: Log-log graphs of exploitabilities for each game tested using our GPU implementation for the first 16,384 iterations. Note that some of these games are not 2-player zero-sum games where the concept of exploitability is not well-defined. These are only analyzed for games we tested in Experiment 1.

Theorems & Definitions (1)

  • Definition 1