Table of Contents
Fetching ...

GPUnion: Autonomous GPU Sharing on Campus

Yufang Li, Yuanbo Zhang, Hanlong Liao, Deke Guo, Guoming Tang

TL;DR

GPUnion tackles campus-scale GPU sharing by combining containerized GPU execution, an autonomy-first provider model, and resilient, checkpoint-driven migration to tolerate voluntary participation. It achieves near-native performance across heterogeneous hardware while preserving provider control through kill-switches and fast migrations guided by application-level checkpoints. Case studies in a multi-server campus environment demonstrate increased GPU utilization, more interactive sessions, and high migration success during departures, with minimal network overhead. Overall, GPUnion shows that trusted, lightweight, provider-driven resource pooling can democratize access to AI compute without centralized governance.

Abstract

A pronounced imbalance in GPU resources exists on campus, where some laboratories own underutilized servers while others lack the compute needed for AI research. GPU sharing can alleviate this disparity, while existing platforms typically rely on centralized oversight and persistent allocation models, conflicting with the voluntary and autonomous nature of academic resource ownership. We present GPUnion, a campus-scale GPU sharing platform enabling voluntary participation while preserving full provider autonomy. GPUnion incorporates three core mechanisms: i) container-based task dispatching and execution, ii) resource provider-first architecture, and iii) resilient execution featuring automatic check-pointing and migration. Case studies across multiple campus scenarios demonstrate 30% more GPU utilization improvement, 40% increase in interactive sessions, and 94% successful workload migration during provider departures.

GPUnion: Autonomous GPU Sharing on Campus

TL;DR

GPUnion tackles campus-scale GPU sharing by combining containerized GPU execution, an autonomy-first provider model, and resilient, checkpoint-driven migration to tolerate voluntary participation. It achieves near-native performance across heterogeneous hardware while preserving provider control through kill-switches and fast migrations guided by application-level checkpoints. Case studies in a multi-server campus environment demonstrate increased GPU utilization, more interactive sessions, and high migration success during departures, with minimal network overhead. Overall, GPUnion shows that trusted, lightweight, provider-driven resource pooling can democratize access to AI compute without centralized governance.

Abstract

A pronounced imbalance in GPU resources exists on campus, where some laboratories own underutilized servers while others lack the compute needed for AI research. GPU sharing can alleviate this disparity, while existing platforms typically rely on centralized oversight and persistent allocation models, conflicting with the voluntary and autonomous nature of academic resource ownership. We present GPUnion, a campus-scale GPU sharing platform enabling voluntary participation while preserving full provider autonomy. GPUnion incorporates three core mechanisms: i) container-based task dispatching and execution, ii) resource provider-first architecture, and iii) resilient execution featuring automatic check-pointing and migration. Case studies across multiple campus scenarios demonstrate 30% more GPU utilization improvement, 40% increase in interactive sessions, and 94% successful workload migration during provider departures.

Paper Structure

This paper contains 15 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: GPUnion system architecture diagram.
  • Figure 2: Research group GPU utilization comparison.
  • Figure 3: Migration performance under different interruption scenarios.