Rudder: Steering Prefetching in Distributed GNN Training using LLM Agents

Aishwarya Sarkar; Sayan Ghosh; Nathan Tallent; Aman Chadha; Tanya Roosta; Ali Jannesari

Rudder: Steering Prefetching in Distributed GNN Training using LLM Agents

Aishwarya Sarkar, Sayan Ghosh, Nathan Tallent, Aman Chadha, Tanya Roosta, Ali Jannesari

TL;DR

Rudder, a software module embedded in the state-of-the-art AWS DistDGL framework, to autonomously prefetch remote nodes and minimize communication is introduced, observing that the generative AI found in contemporary Large Language Models (LLMs) exhibits emergent properties like In-Context Learning (ICL) for zero-shot tasks, with logical multi-step reasoning.

Abstract

Large-scale Graph Neural Networks (GNNs) are typically trained by sampling a vertex's neighbors to a fixed distance. Because large input graphs are distributed, training requires frequent irregular communication that stalls forward progress. Moreover, fetched data changes with graph, graph distribution, sample and batch parameters, and caching polices. Consequently, any static prefetching method will miss crucial opportunities to adapt to different dynamic conditions. In this paper, we introduce Rudder, a software module embedded in the state-of-the-art AWS DistDGL framework, to autonomously prefetch remote nodes and minimize communication. Rudder's adaptation contrasts with both standard heuristics and traditional ML classifiers. We observe that the generative AI found in contemporary Large Language Models (LLMs) exhibits emergent properties like In-Context Learning (ICL) for zero-shot tasks, with logical multi-step reasoning. We find this behavior well-suited for adaptive control even with substantial undertraining. Evaluations using standard datasets and unseen configurations on the NERSC Perlmutter supercomputer show up to 91% improvement in end-to-end training performance over baseline DistDGL (no prefetching), and an 82% improvement over static prefetching, reducing communication by over 50%. Our code is available at https://github.com/aishwaryyasarkar/rudder-llm-agent.

Rudder: Steering Prefetching in Distributed GNN Training using LLM Agents

TL;DR

Abstract

Paper Structure (29 sections, 2 theorems, 21 figures, 5 tables, 1 algorithm)

This paper contains 29 sections, 2 theorems, 21 figures, 5 tables, 1 algorithm.

Introduction
Background and Motivation
Prefetching and Replacement Strategies
Intelligent Prefetching Controllers
LLM agents vs. ML classifiers
Tradeoffs
LLM characteristics
Related Work
Enhancing GNN Training using Rudder
Tasks Creation and Deployment
Components of the Agentic Workflow
LLM Agent Decision Making
Decision Trajectory
Prompt Engineering
ML Classifier Decision Making
...and 14 more sections

Key Result

corollary 1

For nonzero observations set $\mathbb{S}_{\neq 0}$ and computational resources $\mathbb{R} \propto \{time\, , memory\}$, $\mathbb{R}^{\mathbb{S}}_{SL} > \mathbb{R}^{\mathbb{S}}_{ICL}$ considering comparable test times (i.e., $T_{test(\Theta=SL)} \simeq T_{test(\Theta=ICL)}$).

Figures (21)

Figure 1: Declining unique remote nodes in GNN training.
Figure 2: Prefetching interfaces can range from simple to burdensome. The simplest designs sacrifice performance; while the most burdensome require enormous tuning. Rudder achieves high performance while requiring little tuning.
Figure 3: Adaptive replacement consistently yields best %-Hits (higher is better), relative to other replacement strategies.
Figure 4: High-level aspects of our replacement strategy based on a scoring policy which tracks recent usage.
Figure 5: LLM agent learns by interacting with environment through auxiliary tools, whereas ML classifiers are trained offline.
...and 16 more figures

Theorems & Definitions (5)

Remark 1: Strong scaling
Remark 2: Diminishing overlap
Remark 3: Distribution shifts
corollary 1: LLM agents require less resources for bootstrapping
corollary 2: LLM agents are resilient to distribution shifts

Rudder: Steering Prefetching in Distributed GNN Training using LLM Agents

TL;DR

Abstract

Rudder: Steering Prefetching in Distributed GNN Training using LLM Agents

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (21)

Theorems & Definitions (5)