Exposing Shadow Branches
Chrysanthos Pepi, Bhargav Reddy Godala, Krishnam Tibrewala, Gino Chacon, Paul V. Gratz, Daniel A. Jiménez, Gilles A. Pokam, David I. August
TL;DR
Skia addresses front-end bottlenecks from BTB misses in decoupled FDIP architectures by exploiting shadow branches—unused bytes within already fetched instruction cache lines. It introduces the Shadow Branch Decoder (SBD) and the Shadow Branch Buffer (SBB), a compact two-buffer scheme that stores and serves direct unconditional branches and returns in parallel with BTB lookups. The approach leverages index computation and path validation to identify head and tail shadow branches, achieving a geomean IPC improvement of about 5.64% over a baseline 8K-entry BTB, and around 2% when compared to allocating the same state to the BTB; gains are especially pronounced when shadow branches differ from BTB-stored branches. With a 12.25KB footprint, Skia demonstrates robust performance across 16 front-end-bound workloads and remains effective when scaling BTB size, offering a practical path to reduce front-end stalls in data-center CPUs without heavy L1-I traffic or BTB pollution.
Abstract
Modern processors implement a decoupled front-end in the form of Fetch Directed Instruction Prefetching (FDIP) to avoid front-end stalls. FDIP is driven by the Branch Prediction Unit (BPU), relying on the BPU's accuracy and branch target tracking structures to speculatively fetch instructions into the Instruction Cache (L1I). As data center applications become more complex, their code footprints also grow, resulting in an increase in Branch Target Buffer (BTB) misses. FDIP can alleviate L1I cache misses, but when it encounters a BTB miss, the BPU may not identify the current instruction as a branch to FDIP. This can prevent FDIP from prefetching or cause it to speculate down the wrong path, further polluting the L1I cache. We observe that the vast majority, 75%, of BTB-missing, unidentified branches are actually present in instruction cache lines that FDIP has previously fetched but, these missing branches have not yet been decoded and inserted into the BTB. This is because the instruction line is decoded from an entry point (which is the target of the previous taken branch) till an exit point (the taken branch). Branch instructions present in the ignored portion of the cache line we call them "Shadow Branches". Here we present Skeia, a novel shadow branch decoding technique that identifies and decodes unused bytes in cache lines fetched by FDIP, inserting them into a Shadow Branch Buffer (SBB). The SBB is accessed in parallel with the BTB, allowing FDIP to speculate despite a BTB miss. With a minimal storage state of 12.25KB, Skeia delivers a geomean speedup of ~5.7% over an 8K-entry BTB (78KB) and ~2% versus adding an equal amount of state to the BTB across 16 front-end bound applications. Since many branches stored in the SBB are unique compared to those in a similarly sized BTB, we consistently observe greater performance gains with Skeia across all examined sizes until saturation.
