Table of Contents
Fetching ...

CubicML: Automated ML for Large ML Systems Co-design with ML Prediction of Performance

Wei Wen, Quanyu Zhu, Weiwei Chu, Wen-Yen Chen, Jiyan Yang

TL;DR

In CubicML, an ML model is used as a proxy to predict the training performance for search efficiency and performance modeling flexibility and it is proved that CubicML can effectively optimize training speed of in-house ads recommendation models with 73 billion parameters and large language models up to 405 billion parameters at Meta.

Abstract

Scaling up deep learning models has been proven effective to improve intelligence of machine learning (ML) models, especially for industry recommendation models and large language models. The co-design of large distributed ML systems and algorithms (to maximize training performance) plays a pivotal role for its success. As it scales, the number of co-design hyper-parameters grows rapidly which brings challenges to feasibly find the optimal setup for system performance maximization. In this paper, we propose CubicML which uses ML to automatically optimize training performance of large distributed ML systems. In CubicML, we use an ML model as a proxy to predict the training performance for search efficiency and performance modeling flexibility. We proved that CubicML can effectively optimize training speed of in-house ads recommendation models with 73 billion parameters and large language models up to 405 billion parameters at Meta.

CubicML: Automated ML for Large ML Systems Co-design with ML Prediction of Performance

TL;DR

In CubicML, an ML model is used as a proxy to predict the training performance for search efficiency and performance modeling flexibility and it is proved that CubicML can effectively optimize training speed of in-house ads recommendation models with 73 billion parameters and large language models up to 405 billion parameters at Meta.

Abstract

Scaling up deep learning models has been proven effective to improve intelligence of machine learning (ML) models, especially for industry recommendation models and large language models. The co-design of large distributed ML systems and algorithms (to maximize training performance) plays a pivotal role for its success. As it scales, the number of co-design hyper-parameters grows rapidly which brings challenges to feasibly find the optimal setup for system performance maximization. In this paper, we propose CubicML which uses ML to automatically optimize training performance of large distributed ML systems. In CubicML, we use an ML model as a proxy to predict the training performance for search efficiency and performance modeling flexibility. We proved that CubicML can effectively optimize training speed of in-house ads recommendation models with 73 billion parameters and large language models up to 405 billion parameters at Meta.
Paper Structure (13 sections, 4 figures)

This paper contains 13 sections, 4 figures.

Figures (4)

  • Figure 1: CubicML framework overview.
  • Figure 2: (a) CubicML result (with predictor-based RL searcher) when optimizing QPS of an ads recommendation model by searching FSDP and other co-design hyper-parameters. x-axis: the number of configurations/jobs CubicML run. y-axis: $90$-th percentile QPS. In the plots, QPS values are normalized/divided by a constant. The same color illustrates the same round of search with dotted line indicating ground-truth QPS value per sample/configuration and solid line indicating maximal QPS frontier observed as each round proceeds. For random search round, we average maximal frontiers over $100$ perturbations. Note that in "round 1" predictor-based RL search, we only launch a few top jobs for a quick test during development, ending up a very short line. Jobs failed because of out-of-memory or infra failures are not plotted. (b) rank correlation of ground-truth QPS and predicted QPS by the "predictor". A validation dataset is used here. Note that we use pairwise ranking loss to train the predictor and the absolute values of predicted QPS does not need to approximate the ground truth QPS as long as the rank correlation is high.
  • Figure 3: Predicted WPS versus actual WPS by the predictor in CubicML. Left: random split of dataset; middle: use examples in older jobs to predict newer jobs; right: use examples with smaller numbers of GPUs to predict larger scale with more GPUs. Note that all WPS values are normalized/divided by a constant in a plot.
  • Figure 4: Left: the number of GPUs used in jobs sorted by timestamps. Note that timestamps are normalized/divided by a constant. Right: rank correlation versus the number of training examples by random split of dataset. Shading bands are $\pm 2.0 \times$ standard deviation over ten random perturbations.