WaSP: Warp Scheduling to Mimic Prefetching in Graphics Workloads
Diya Joseph, Juan Luis Aragón, Joan-Manuel Parcerisa, Antonio Gonzalez
TL;DR
WaSP tackles long-latency memory challenges in Tile-Based Rendering GPUs by exploiting underutilized memory parallelism through a small subset of priority warps. It selects a compact Mesh4-based subset of warps to represent the tile's texture footprint and uses a blocking-prediction mechanism to avoid saturating MSHRs, effectively emulating prefetching for regular warps. Empirical results on TEAPOT simulations with Android benchmarks show an average IPC gain of $3.9\%$, a $9\%$ reduction in memory latency, and only $0.5\%$ additional energy consumption, with negligible hardware overhead. This approach offers a practical enhancement to latency hiding in graphics workloads, complementing existing texture-locality and caching strategies.
Abstract
Contemporary GPUs are designed to handle long-latency operations effectively; however, challenges such as core occupancy (number of warps in a core) and pipeline width can impede their latency management. This is particularly evident in Tile-Based Rendering (TBR) GPUs, where core occupancy remains low for extended durations. To address this challenge, we introduce WaSP, a lightweight warp scheduler tailored for GPUs in graphics applications. WaSP strategically mimics prefetching by initiating a select subset of warps, termed priority warps, early in execution to reduce memory latency for subsequent warps. This optimization taps into the inherent but underutilized memory parallelism within the GPU core. This underutilization is a consequence of a baseline scheduler that evenly spaces misses throughout execution to exploit the inherent spatial locality in graphics workloads. WaSP improves on this by reducing average memory latency while maintaining locality for the majority of warps. While maximizing memory parallelism utilization, WaSP prevents saturating the caches with misses to avoid filling up the MSHRs (Miss Status Holding Registers). This approach reduces cache stalls that halt further accesses to the cache. Overall, WaSP yields a significant 3.9% performance speedup. Importantly, WaSP accomplishes these enhancements with a negligible overhead, positioning it as a promising solution for enhancing the efficiency of GPUs in managing latency challenges.
