Ridgeline: A 2D Roofline Model for Distributed Systems

Fabio Checconi; Jesmin Jahan Tithi; Fabrizio Petrini

Ridgeline: A 2D Roofline Model for Distributed Systems

Fabio Checconi, Jesmin Jahan Tithi, Fabrizio Petrini

TL;DR

The paper addresses the limitation of the traditional Roofline model in distributed systems by incorporating interconnect bandwidth as a bottleneck. It introduces the Ridgeline, a 2D plane that fuses compute, memory, and network constraints using intensities $I_A$, $I_M$, and $I_N$ with $I_A=F/B_M$, $I_M=B_M/B_N$, and $I_N=F/B_N$, and adds a balance line based on $xy=k$ to distinguish whether compute or network dominates. The key contributions are the planar visualization that reveals bottlenecks, a principled method to combine multiple resource limits, and a practical runtime estimation framework for the dominant bound. The approach is demonstrated on a data-parallel MLP workload (e.g., Facebook's DLRM) showing that batch-size-driven intensity shifts align with bottleneck predictions, providing a tool to guide optimization in distributed ML and HPC workloads.

Abstract

In this short paper, we introduce the Ridgeline model, an extension of the Roofline model [4] for distributed systems. The Roofline model targets shared memory systems, bounding the performance of a kernel based on its operational intensity, and the peak compute throughput and memory bandwidth of the execution system. In a distributed setting, with multiple communicating compute entities, the network must be taken into account to model the system behavior accurately. The Ridgeline aggregates information on compute, memory, and network limits in one 2D plot to show, in an intuitive way, which of the resources is the expected bottleneck. We show the applicability of the Ridgeline in a case study based on a data-parallel Multi-Layer Perceptron (MLP) instance.

Ridgeline: A 2D Roofline Model for Distributed Systems

TL;DR

, and

with

, and

, and adds a balance line based on

to distinguish whether compute or network dominates. The key contributions are the planar visualization that reveals bottlenecks, a principled method to combine multiple resource limits, and a practical runtime estimation framework for the dominant bound. The approach is demonstrated on a data-parallel MLP workload (e.g., Facebook's DLRM) showing that batch-size-driven intensity shifts align with bottleneck predictions, providing a tool to guide optimization in distributed ML and HPC workloads.

Abstract

Paper Structure (4 sections, 6 figures, 2 tables)

This paper contains 4 sections, 6 figures, 2 tables.

Introduction
The Ridgeline Model
Case Study
Conclusion

Figures (6)

Figure 1: A Naive 3D extension of Roofline.
Figure 2: Illustration of the Ridgeline principles.
Figure 3: The surface showing FLOPS as a function of network and memory intensities.
Figure 4: Roofline for an MLP instance on a Cascade Lake system. Analysis done on multiple points, corresponding to different batch sizes.
Figure 5: Individual layer of a Multi-layer Perception Network.
...and 1 more figures

Ridgeline: A 2D Roofline Model for Distributed Systems

TL;DR

Abstract

Ridgeline: A 2D Roofline Model for Distributed Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (6)