Table of Contents
Fetching ...

Flex-MIG: Enabling Distributed Execution on MIG

Myeongsu Kim, Ikjun Yeom, Younghoon Kim

TL;DR

Flex-MIG reframes MIG from a rigid one-to-one hardware partitioning model into a software-managed one-to-many framework that enables a single job to span multiple MIG leaves without draining or reconfiguring GPUs. It introduces two coordinated layers: an orchestration layer that schedules and places multiple MIG leaves per job using size- and topology-aware heuristics, and a runtime layer that enables Host Shared Memory collectives across MIG instances by extending NCCL with MIG-aware peer discovery and synthetic Bus-ID labeling. The approach flattens resource utilization, reduces fragmentation, and avoids disruptive reconfiguration, achieving up to 17% improvements in makespan and higher cluster throughput in trace-driven simulations validated against real measurements. This work demonstrates the practical potential of software-driven resource coordination to unlock efficiency gains in multi-tenant GPU clusters while preserving hardware isolation. Flex-MIG thus offers a scalable path to better MIG utilization for small-to-medium AI workloads in cloud and on-prem environments.

Abstract

GPU clusters in multi-tenant settings often suffer from underutilization, making GPU-sharing technologies essential for efficient resource use. Among them, NVIDIA Multi-Instance GPU (MIG) has gained traction for providing hardware-level isolation that enables concurrent workloads without interference. However, MIG's hardware rigidity and the conventional one-to-one allocation model jointly lead to severe fragmentation and cluster-wide underutilization. We present Flex-MIG, a software-only framework that replaces one-to-one with a one-to-many allocation model and enables host-shared-memory collectives across MIG instances without hardware modification. Flex-MIG eliminates drain-required reconfiguration, reduces fragmentation, and improves makespan by up to 17% across diverse traces, showing that rethinking MIG's operational model as a software-coordinated layer substantially improves cluster efficiency.

Flex-MIG: Enabling Distributed Execution on MIG

TL;DR

Flex-MIG reframes MIG from a rigid one-to-one hardware partitioning model into a software-managed one-to-many framework that enables a single job to span multiple MIG leaves without draining or reconfiguring GPUs. It introduces two coordinated layers: an orchestration layer that schedules and places multiple MIG leaves per job using size- and topology-aware heuristics, and a runtime layer that enables Host Shared Memory collectives across MIG instances by extending NCCL with MIG-aware peer discovery and synthetic Bus-ID labeling. The approach flattens resource utilization, reduces fragmentation, and avoids disruptive reconfiguration, achieving up to 17% improvements in makespan and higher cluster throughput in trace-driven simulations validated against real measurements. This work demonstrates the practical potential of software-driven resource coordination to unlock efficiency gains in multi-tenant GPU clusters while preserving hardware isolation. Flex-MIG thus offers a scalable path to better MIG utilization for small-to-medium AI workloads in cloud and on-prem environments.

Abstract

GPU clusters in multi-tenant settings often suffer from underutilization, making GPU-sharing technologies essential for efficient resource use. Among them, NVIDIA Multi-Instance GPU (MIG) has gained traction for providing hardware-level isolation that enables concurrent workloads without interference. However, MIG's hardware rigidity and the conventional one-to-one allocation model jointly lead to severe fragmentation and cluster-wide underutilization. We present Flex-MIG, a software-only framework that replaces one-to-one with a one-to-many allocation model and enables host-shared-memory collectives across MIG instances without hardware modification. Flex-MIG eliminates drain-required reconfiguration, reduces fragmentation, and improves makespan by up to 17% across diverse traces, showing that rethinking MIG's operational model as a software-coordinated layer substantially improves cluster efficiency.

Paper Structure

This paper contains 31 sections, 18 figures, 3 tables.

Figures (18)

  • Figure 1: MIG architecture. MIG partitions a GPU into hardware-isolated instances with dedicated SM/L2/memory slices.
  • Figure 2: Coarse-grained MIG profiles and over-provisioning. The x-axis and y-axis represent the fractions of total compute units (SMs) and memory capacity (GB), respectively. Each rectangle denotes a fixed MIG profile. When a workload’s demand falls between profiles, it is rounded up to the nearest larger profile. The red shaded areas indicate the amount of excess compute or memory allocated in examples (A) and (B).
  • Figure 3: Fragmentation under MIG. (a) Internal fragmentation: Due to the tree-constrained layout, case (i) can be merged into 2g.10gb; case (ii) cannot be merged. (b) External fragmentation: instances on physically distinct GPUs cannot be merged into a larger instance.
  • Figure 4: Flex-MIG system overview. Flex-MIG operates over a three-layer cluster abstraction---user/application, orchestration, and runtime---to realize one-to-many execution using fixed, minimum-sized MIG leaves. The orchestration layer manages job queuing and instance allocation, while the runtime layer enables Host Shared Memory (SHM) collectives across MIG instances.
  • Figure 5: MIG-aware NCCL runtime workflow.
  • ...and 13 more figures