Table of Contents
Fetching ...

Training Overhead Ratio: A Practical Reliability Metric for Large Language Model Training Systems

Ning Lu, Qian Xie, Hao Zhang, Wenyi Fang, Yang Zheng, Zheng Hu, Jiantao Ma

TL;DR

Large Language Model training suffers from frequent failures during long, distributed runs, and existing reliability metrics fail to capture the true overhead. The authors introduce Training Overhead Ratio (TOR), defined as $TOR = \frac{T_{opt}}{T_{obs}}$, paired with a performance preservation ratio $r(t) = w_{obs}(t)/W_{opt}$ and the integral relation $T_{opt} = \int_{0}^{T_{obs}} r(t) \, dt$, to quantify reliability under faults. They model two failure types, fail-stop and fail-slow, as a repeating unit with stages (Slow Recovery, Healthy Run, Checkpoint Saving, Repair) and derive closed-form TOR expressions, leveraging MTBF interpretations. These contributions enable practitioners to estimate realistic training durations, identify reliability bottlenecks, and compare fault-tolerant strategies in LLM training systems, with practical implications for planning and cost management.

Abstract

Large Language Models (LLMs) are revolutionizing the AI industry with their superior capabilities. Training these models requires large-scale GPU clusters and significant computing time, leading to frequent failures that significantly increase training costs. Despite its significance, this field lacks a metric for evaluating reliability. In this work, we introduce a novel reliability metric called \emph{Training Overhead Ratio} (TOR) to evaluate the reliability of fault-tolerant LLM training systems. TOR is defined as the ratio of optimal training time to the observed training time of a system, serving as a practical tool for users to estimate the actual time required to train an LLM on a given system. Furthermore, our investigation identifies the key factor for enhancing reliability and present TOR equations for various types of failures encountered in practice.

Training Overhead Ratio: A Practical Reliability Metric for Large Language Model Training Systems

TL;DR

Large Language Model training suffers from frequent failures during long, distributed runs, and existing reliability metrics fail to capture the true overhead. The authors introduce Training Overhead Ratio (TOR), defined as , paired with a performance preservation ratio and the integral relation , to quantify reliability under faults. They model two failure types, fail-stop and fail-slow, as a repeating unit with stages (Slow Recovery, Healthy Run, Checkpoint Saving, Repair) and derive closed-form TOR expressions, leveraging MTBF interpretations. These contributions enable practitioners to estimate realistic training durations, identify reliability bottlenecks, and compare fault-tolerant strategies in LLM training systems, with practical implications for planning and cost management.

Abstract

Large Language Models (LLMs) are revolutionizing the AI industry with their superior capabilities. Training these models requires large-scale GPU clusters and significant computing time, leading to frequent failures that significantly increase training costs. Despite its significance, this field lacks a metric for evaluating reliability. In this work, we introduce a novel reliability metric called \emph{Training Overhead Ratio} (TOR) to evaluate the reliability of fault-tolerant LLM training systems. TOR is defined as the ratio of optimal training time to the observed training time of a system, serving as a practical tool for users to estimate the actual time required to train an LLM on a given system. Furthermore, our investigation identifies the key factor for enhancing reliability and present TOR equations for various types of failures encountered in practice.
Paper Structure (6 sections, 5 equations, 1 figure)

This paper contains 6 sections, 5 equations, 1 figure.

Figures (1)

  • Figure 1: The changes of performance preservation ratio $\mathbf{r}(t)$ in different status in one failure-repair period, for fail-stop (upper) and fail-slow (lower) failures.