Table of Contents
Fetching ...

Rudder: Steering Prefetching in Distributed GNN Training using LLM Agents

Aishwarya Sarkar, Sayan Ghosh, Nathan Tallent, Aman Chadha, Tanya Roosta, Ali Jannesari

TL;DR

Rudder, a software module embedded in the state-of-the-art AWS DistDGL framework, to autonomously prefetch remote nodes and minimize communication is introduced, observing that the generative AI found in contemporary Large Language Models (LLMs) exhibits emergent properties like In-Context Learning (ICL) for zero-shot tasks, with logical multi-step reasoning.

Abstract

Large-scale Graph Neural Networks (GNNs) are typically trained by sampling a vertex's neighbors to a fixed distance. Because large input graphs are distributed, training requires frequent irregular communication that stalls forward progress. Moreover, fetched data changes with graph, graph distribution, sample and batch parameters, and caching polices. Consequently, any static prefetching method will miss crucial opportunities to adapt to different dynamic conditions. In this paper, we introduce Rudder, a software module embedded in the state-of-the-art AWS DistDGL framework, to autonomously prefetch remote nodes and minimize communication. Rudder's adaptation contrasts with both standard heuristics and traditional ML classifiers. We observe that the generative AI found in contemporary Large Language Models (LLMs) exhibits emergent properties like In-Context Learning (ICL) for zero-shot tasks, with logical multi-step reasoning. We find this behavior well-suited for adaptive control even with substantial undertraining. Evaluations using standard datasets and unseen configurations on the NERSC Perlmutter supercomputer show up to 91% improvement in end-to-end training performance over baseline DistDGL (no prefetching), and an 82% improvement over static prefetching, reducing communication by over 50%. Our code is available at https://github.com/aishwaryyasarkar/rudder-llm-agent.

Rudder: Steering Prefetching in Distributed GNN Training using LLM Agents

TL;DR

Rudder, a software module embedded in the state-of-the-art AWS DistDGL framework, to autonomously prefetch remote nodes and minimize communication is introduced, observing that the generative AI found in contemporary Large Language Models (LLMs) exhibits emergent properties like In-Context Learning (ICL) for zero-shot tasks, with logical multi-step reasoning.

Abstract

Large-scale Graph Neural Networks (GNNs) are typically trained by sampling a vertex's neighbors to a fixed distance. Because large input graphs are distributed, training requires frequent irregular communication that stalls forward progress. Moreover, fetched data changes with graph, graph distribution, sample and batch parameters, and caching polices. Consequently, any static prefetching method will miss crucial opportunities to adapt to different dynamic conditions. In this paper, we introduce Rudder, a software module embedded in the state-of-the-art AWS DistDGL framework, to autonomously prefetch remote nodes and minimize communication. Rudder's adaptation contrasts with both standard heuristics and traditional ML classifiers. We observe that the generative AI found in contemporary Large Language Models (LLMs) exhibits emergent properties like In-Context Learning (ICL) for zero-shot tasks, with logical multi-step reasoning. We find this behavior well-suited for adaptive control even with substantial undertraining. Evaluations using standard datasets and unseen configurations on the NERSC Perlmutter supercomputer show up to 91% improvement in end-to-end training performance over baseline DistDGL (no prefetching), and an 82% improvement over static prefetching, reducing communication by over 50%. Our code is available at https://github.com/aishwaryyasarkar/rudder-llm-agent.
Paper Structure (29 sections, 2 theorems, 21 figures, 5 tables, 1 algorithm)

This paper contains 29 sections, 2 theorems, 21 figures, 5 tables, 1 algorithm.

Key Result

corollary 1

For nonzero observations set $\mathbb{S}_{\neq 0}$ and computational resources $\mathbb{R} \propto \{time\, , memory\}$, $\mathbb{R}^{\mathbb{S}}_{SL} > \mathbb{R}^{\mathbb{S}}_{ICL}$ considering comparable test times (i.e., $T_{test(\Theta=SL)} \simeq T_{test(\Theta=ICL)}$).

Figures (21)

  • Figure 1: Declining unique remote nodes in GNN training.
  • Figure 2: Prefetching interfaces can range from simple to burdensome. The simplest designs sacrifice performance; while the most burdensome require enormous tuning. Rudder achieves high performance while requiring little tuning.
  • Figure 3: Adaptive replacement consistently yields best %-Hits (higher is better), relative to other replacement strategies.
  • Figure 4: High-level aspects of our replacement strategy based on a scoring policy which tracks recent usage.
  • Figure 5: LLM agent learns by interacting with environment through auxiliary tools, whereas ML classifiers are trained offline.
  • ...and 16 more figures

Theorems & Definitions (5)

  • Remark 1: Strong scaling
  • Remark 2: Diminishing overlap
  • Remark 3: Distribution shifts
  • corollary 1: LLM agents require less resources for bootstrapping
  • corollary 2: LLM agents are resilient to distribution shifts