Stale Profile Matching
Amir Ayupov, Maksim Panchenko, Sergey Pupyrev
TL;DR
This work tackles profile staleness in profile-guided optimization by introducing a practical two‑stage approach—matching and inference—to reuse profiles collected on binaries built several revisions behind the release. Implemented in the BOLT post‑link optimizer, the method uses a hierarchical, multi‑level block hashing scheme to match basic blocks and a minimum‑cost flow framework to infer consistent block and branch counts, preserving as much of the original guidance as possible. Across large open‑source binaries and production Meta workloads, it recovers a substantial portion ($0.6{-}0.8$) of the maximum potential BOLT speedup when profiles are stale, including a clang example achieving $0.78$ of PGO benefits with over $90\%$ stale data. The approach reduces the need for re‑profiling after hotfixes, broadens the practical adoption of PGO, and is general enough to adapt to other PGO systems beyond BOLT.
Abstract
Profile-guided optimizations rely on profile data for directing compilers to generate optimized code. To achieve the maximum performance boost, profile data needs to be collected on the same version of the binary that is being optimized. In practice however, there is typically a gap between the profile collection and the release, which makes a portion of the profile invalid for optimizations. This phenomenon is known as profile staleness, and it is a serious practical problem for data-center workloads both for compilers and binary optimizers. In this paper we thoroughly study the staleness problem and propose the first practical solution for utilizing profiles collected on binaries built from several revisions behind the release. Our algorithm is developed and implemented in a mainstream open-source post-link optimizer, BOLT. An extensive evaluation on a variety of standalone benchmarks and production services indicates that the new method recovers up to $0.8$ of the maximum BOLT benefit, even when most of the input profile data is stale and would have been discarded by the optimizer otherwise.
