Ridgeline: A 2D Roofline Model for Distributed Systems
Fabio Checconi, Jesmin Jahan Tithi, Fabrizio Petrini
TL;DR
The paper addresses the limitation of the traditional Roofline model in distributed systems by incorporating interconnect bandwidth as a bottleneck. It introduces the Ridgeline, a 2D plane that fuses compute, memory, and network constraints using intensities $I_A$, $I_M$, and $I_N$ with $I_A=F/B_M$, $I_M=B_M/B_N$, and $I_N=F/B_N$, and adds a balance line based on $xy=k$ to distinguish whether compute or network dominates. The key contributions are the planar visualization that reveals bottlenecks, a principled method to combine multiple resource limits, and a practical runtime estimation framework for the dominant bound. The approach is demonstrated on a data-parallel MLP workload (e.g., Facebook's DLRM) showing that batch-size-driven intensity shifts align with bottleneck predictions, providing a tool to guide optimization in distributed ML and HPC workloads.
Abstract
In this short paper, we introduce the Ridgeline model, an extension of the Roofline model [4] for distributed systems. The Roofline model targets shared memory systems, bounding the performance of a kernel based on its operational intensity, and the peak compute throughput and memory bandwidth of the execution system. In a distributed setting, with multiple communicating compute entities, the network must be taken into account to model the system behavior accurately. The Ridgeline aggregates information on compute, memory, and network limits in one 2D plot to show, in an intuitive way, which of the resources is the expected bottleneck. We show the applicability of the Ridgeline in a case study based on a data-parallel Multi-Layer Perceptron (MLP) instance.
