AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs
Zihao Zeng, Chubo Liu, Xin He, Juan Hu, Yong Jiang, Fei Huang, Kenli Li, Wei Yang Bryan Lim
TL;DR
AutoHete tackles the GPU memory wall in LLM training by automatically coordinating activation checkpointing, parameter offloading, and optimizer offloading. It introduces a two-stage workflow: a profiler and ILP-based solver to derive an optimal memory-time plan at the transformer-block level, followed by a priority-based scheduler that overlaps operations across iterations. The framework demonstrates up to $\sim$1.9× throughput gains over state-of-the-art heterogeneous systems in single- and multi-GPU settings, and effectively enables training larger models within existing hardware budgets. This approach democratizes access to large-scale transformer training by reducing CPU and memory bottlenecks without compromising convergence.
Abstract
Transformer-based large language models (LLMs) have demonstrated exceptional capabilities in sequence modeling and text generation, with improvements scaling proportionally with model size. However, the limitations of GPU memory have restricted LLM training accessibility for many researchers. Existing heterogeneous training methods significantly expand the scale of trainable models but introduce substantial communication overheads and CPU workloads. In this work, we propose AutoHete, an automatic and efficient heterogeneous training system compatible with both single-GPU and multi-GPU environments. AutoHete dynamically adjusts activation checkpointing, parameter offloading, and optimizer offloading based on the specific hardware configuration and LLM training needs. Additionally, we design a priority-based scheduling mechanism that maximizes the overlap between operations across training iterations, enhancing throughput. Compared to state-of-the-art heterogeneous training systems, AutoHete delivers a 1.32x~1.91x throughput improvement across various model sizes and training configurations.
