WaSP: Warp Scheduling to Mimic Prefetching in Graphics Workloads

Diya Joseph; Juan Luis Aragón; Joan-Manuel Parcerisa; Antonio Gonzalez

WaSP: Warp Scheduling to Mimic Prefetching in Graphics Workloads

Diya Joseph, Juan Luis Aragón, Joan-Manuel Parcerisa, Antonio Gonzalez

TL;DR

WaSP tackles long-latency memory challenges in Tile-Based Rendering GPUs by exploiting underutilized memory parallelism through a small subset of priority warps. It selects a compact Mesh4-based subset of warps to represent the tile's texture footprint and uses a blocking-prediction mechanism to avoid saturating MSHRs, effectively emulating prefetching for regular warps. Empirical results on TEAPOT simulations with Android benchmarks show an average IPC gain of $3.9\%$, a $9\%$ reduction in memory latency, and only $0.5\%$ additional energy consumption, with negligible hardware overhead. This approach offers a practical enhancement to latency hiding in graphics workloads, complementing existing texture-locality and caching strategies.

Abstract

Contemporary GPUs are designed to handle long-latency operations effectively; however, challenges such as core occupancy (number of warps in a core) and pipeline width can impede their latency management. This is particularly evident in Tile-Based Rendering (TBR) GPUs, where core occupancy remains low for extended durations. To address this challenge, we introduce WaSP, a lightweight warp scheduler tailored for GPUs in graphics applications. WaSP strategically mimics prefetching by initiating a select subset of warps, termed priority warps, early in execution to reduce memory latency for subsequent warps. This optimization taps into the inherent but underutilized memory parallelism within the GPU core. This underutilization is a consequence of a baseline scheduler that evenly spaces misses throughout execution to exploit the inherent spatial locality in graphics workloads. WaSP improves on this by reducing average memory latency while maintaining locality for the majority of warps. While maximizing memory parallelism utilization, WaSP prevents saturating the caches with misses to avoid filling up the MSHRs (Miss Status Holding Registers). This approach reduces cache stalls that halt further accesses to the cache. Overall, WaSP yields a significant 3.9% performance speedup. Importantly, WaSP accomplishes these enhancements with a negligible overhead, positioning it as a promising solution for enhancing the efficiency of GPUs in managing latency challenges.

WaSP: Warp Scheduling to Mimic Prefetching in Graphics Workloads

TL;DR

, a

reduction in memory latency, and only

additional energy consumption, with negligible hardware overhead. This approach offers a practical enhancement to latency hiding in graphics workloads, complementing existing texture-locality and caching strategies.

Abstract

Paper Structure (30 sections, 4 equations, 15 figures, 2 tables)

This paper contains 30 sections, 4 equations, 15 figures, 2 tables.

Introduction
Background
Graphics Pipeline
Memory Organization
Texture Locality
Warp Scheduling
Cache Stalls
WaSP
The Example
Priority Warps Selection
Subset Size Ratio
Quad Selection
Priority Warp Scheduling
Blocking Prediction
Hardware Overhead
...and 15 more sections

Figures (15)

Figure 1: Speedup w.r.t. an ideal main memory with zero latency.
Figure 2: The Graphics Pipeline of a TBR GPU.
Figure 3: Baseline memory hierarchy and memory organization.
Figure 4: The Example.
Figure 5: Estimation of a reasonable priority subset size.
...and 10 more figures

WaSP: Warp Scheduling to Mimic Prefetching in Graphics Workloads

TL;DR

Abstract

WaSP: Warp Scheduling to Mimic Prefetching in Graphics Workloads

Authors

TL;DR

Abstract

Table of Contents

Figures (15)