NNTile: a machine learning framework capable of training extremely large GPT language models on a single node
Aleksandr Mikhalev, Aleksandr Katrutsa, Konstantin Sozykin, Ivan Oseledets
TL;DR
The paper tackles the memory and resource underutilization bottleneck in training very large transformer models by introducing NNTile, a tile-based, task-based framework built on StarPU that dynamically schedules computations across CPUs and GPUs on a single node. It details the tile decomposition, data management, and scheduling strategies that enable efficient training of transformer components, including embedding, attention, and normalization, with standard optimizers like SGD and Adam variants. Experimental results show that NNTile can train GPT2 models up to 49.9B parameters on a single node, significantly surpassing the practical limits of PyTorch FSDP under the same hardware, illustrating the potential of automatic offloading to CPU RAM and dynamic task scheduling. The work demonstrates practical gains in scalability for large language model training and provides open-source access to the framework for broader adoption and further research.
Abstract
This study presents an NNTile framework for training large deep neural networks in heterogeneous clusters. The NNTile is based on a StarPU library, which implements task-based parallelism and schedules all provided tasks onto all available processing units (CPUs and GPUs). It means that a particular operation, necessary to train a large neural network, can be performed on any of the CPU cores or GPU devices, depending on automatic scheduling decisions. Such an approach shifts the burden of deciding where to compute and when to communicate from a human being to an automatic decision maker, whether a simple greedy heuristic or a complex AI-based software. The performance of the presented tool for training large language models is demonstrated in extensive numerical experiments.
