Exploring the Versal AI Engine for 3D Gaussian Splatting
Kotaro Shimamura, Ayumi Ohno, Shinya Takamaeda-Yamazaki
TL;DR
The paper evaluates the Versal AI Engine for 3D Gaussian Splatting by developing a dedicated in-tile, vectorized hardware algorithm that exploits spatial parallelism across AI Engine tiles. The study combines simulator-based AI Engine timing with hardware measurements on VCK190, demonstrating up to 226× throughput gains over a naive baseline while identifying PL data-transfer as a key bottleneck. Key contributions include detailed hardware optimizations (vectorization, task partitioning) and a mapping strategy that leverages the 8×50 AI Engine grid to maximize throughput. The findings provide practical guidance for deploying tile-based architectures in real-time 3D reconstruction and similar high-parallelism workloads, and the insights generalize to AI inference and image processing tasks.
Abstract
Dataflow-oriented spatial architectures are the emerging paradigm for higher computation performance and efficiency. AMD Versal AI Engine is a commercial spatial architecture consisting of tiles of VLIW processors supporting SIMD operations arranged in a two-dimensional mesh. The architecture requires the explicit design of task assignments and dataflow configurations for each tile to maximize performance, demanding advanced techniques and meticulous design. However, a few works revealed the performance characteristics of the Versal AI Engine through practical workloads. In this work, we provide the comprehensive performance evaluation of the Versal AI Engine using Gaussian feature computation in 3D Gaussian splatting as a practical workload, and we then propose a novel dedicated algorithm to fully exploit the hardware architecture. The computations of 3D Gaussian splatting include matrix multiplications and color computations utilizing high-dimensional spherical harmonic coefficients. These tasks are processed efficiently by leveraging the SIMD capabilities and their instruction-level parallelism. Additionally, pipelined processing is achieved by assigning different tasks to individual cores, thereby fully exploiting the spatial parallelism of AI Engines. The proposed method demonstrated a 226-fold throughput increase in simulation-based evaluation, outperforming a naive approach. These findings provide valuable insights for application development that effectively harnesses the spatial and architectural advantages of AI Engines.
