Table of Contents
Fetching ...

CAT: Cellular Automata on Tensor cores

Cristóbal A. Navarro, Felipe A. Quezada, Enzo Meneses, Héctor Ferrada, Nancy Hitschfeld

TL;DR

This work presents CAT, a GPU tensor core approach that can accelerate CA in which the cell transition function acts on a weighted summation of its neighborhood in which the cell transition function acts on a weighted summation of its neighborhood.

Abstract

Cellular automata (CA) are simulation models that can produce complex emergent behaviors from simple local rules. Although state-of-the-art GPU solutions are already fast due to their data-parallel nature, their performance can rapidly degrade in CA with a large neighborhood radius. With the inclusion of tensor cores across the entire GPU ecosystem, interest has grown in finding ways to leverage these fast units outside the field of artificial intelligence, which was their original purpose. In this work, we present CAT, a GPU tensor core approach that can accelerate CA in which the cell transition function acts on a weighted summation of its neighborhood. CAT is evaluated theoretically, using an extended PRAM cost model, as well as empirically using the Larger Than Life (LTL) family of CA as case studies. The results confirm that the cost model is accurate, showing that CAT exhibits constant time throughout the entire radius range $1 \le r \le 16$, and its theoretical speedups agree with the empirical results. At low radius $r=1,2$, CAT is competitive and is only surpassed by the fastest state-of-the-art GPU solution. Starting from $r=3$, CAT progressively outperforms all other approaches, reaching speedups of up to $101\times$ over a GPU baseline and up to $\sim 14\times$ over the fastest state-of-the-art GPU approach. In terms of energy efficiency, CAT is competitive in the range $1 \le r \le 4$ and from $r \ge 5$ it is the most energy efficient approach. As for performance scaling across GPU architectures, CAT shows a promising trend that if continues for future generations, it would increase its performance at a higher rate than classical GPU solutions. The results obtained in this work put CAT as an attractive GPU approach for scientists that need to study emerging phenomena on CA with large neighborhood radius.

CAT: Cellular Automata on Tensor cores

TL;DR

This work presents CAT, a GPU tensor core approach that can accelerate CA in which the cell transition function acts on a weighted summation of its neighborhood in which the cell transition function acts on a weighted summation of its neighborhood.

Abstract

Cellular automata (CA) are simulation models that can produce complex emergent behaviors from simple local rules. Although state-of-the-art GPU solutions are already fast due to their data-parallel nature, their performance can rapidly degrade in CA with a large neighborhood radius. With the inclusion of tensor cores across the entire GPU ecosystem, interest has grown in finding ways to leverage these fast units outside the field of artificial intelligence, which was their original purpose. In this work, we present CAT, a GPU tensor core approach that can accelerate CA in which the cell transition function acts on a weighted summation of its neighborhood. CAT is evaluated theoretically, using an extended PRAM cost model, as well as empirically using the Larger Than Life (LTL) family of CA as case studies. The results confirm that the cost model is accurate, showing that CAT exhibits constant time throughout the entire radius range , and its theoretical speedups agree with the empirical results. At low radius , CAT is competitive and is only surpassed by the fastest state-of-the-art GPU solution. Starting from , CAT progressively outperforms all other approaches, reaching speedups of up to over a GPU baseline and up to over the fastest state-of-the-art GPU approach. In terms of energy efficiency, CAT is competitive in the range and from it is the most energy efficient approach. As for performance scaling across GPU architectures, CAT shows a promising trend that if continues for future generations, it would increase its performance at a higher rate than classical GPU solutions. The results obtained in this work put CAT as an attractive GPU approach for scientists that need to study emerging phenomena on CA with large neighborhood radius.

Paper Structure

This paper contains 21 sections, 17 equations, 13 figures, 5 tables, 1 algorithm.

Figures (13)

  • Figure 1: Traditional data-parallel approach for simulating Cellular Automata (CA) using a global halo of ghost cells which is common for avoiding complex logic on the boundary threads. In this example each thread $t_{i,j}$ is in charge of one cell and must explore its Moore neighborhood of radius $r=1$. In general, with this approach one simulation step costs at least $\Omega(r^2)$ time.
  • Figure 2: Processing group in Nvidia's Hopper architecture. Four of these make up a streaming multiprocessor (SM), and dozens of SMs form an entire GPU chip. Currently, the number of tensor cores in a GPU chip can reach up to the hundreds. Image inspired from the CUDA C Programming guide cudaCProgGuide2024.
  • Figure 3: Concept of how a pair of matrix products between an entire CA ($\Lambda$) and a band matrix ($\Pi$) can count the living neighbors of all cells (no tensor core logic introduced yet). Here, the CA includes a global halo of ghost cells, giving a total size of $(n + 2r)\times (n+2r) = 18 \times 18$ as the neighborhood radius is $r=1$. The final cells of $R$ contain their number of living neighbors plus its own state added twice (cells with no number have value zero).
  • Figure 4: On the left, an explicit representation of the band matrix $\Pi$, which uses $\mathcal{O}(n^2)$ memory. On the right, the CAT representation of $\Pi$, which uses just three fragments $\pi_1, \pi_2, \pi_3$ to represent the entire matrix. Fragments are of size $4\times 4$ just for visual simplicity.
  • Figure 5: Overview of CAT illustrated with a Game of Life of $n\times n = 16 \times 16$ cells, neighborhood $r=1$ and periodic boundary conditions using a global halo of ghost fragments (purple). In the first step all fragments $F^H_{i,j}$ inside the dashed region of $H$ contain the horizontal reduction computed with three sequential MMAs between fragments $F^\Lambda_{i,j-1},F^\Lambda_{i,j},F^\Lambda_{i,j+1}$ and $\pi_1, \pi_2, \pi_3$. In the second step all $F^R_{i,j}$ inside the dashed region of $R$ contain the full reduction computed with three more MMAs between the fragments $\pi_3,\pi_2,\pi_1$ and $F^H_{i-1,j}, F^H_{i,j}, F^H_{i+1,j}$. This gives a total cost of six MMAs per fragment at any radius that fits in the fragment. For this example the fragments were shown as $4\times 4$ for visual clarity, but in practice the ones employed in CAT are of size $16\times 16$.
  • ...and 8 more figures