Table of Contents
Fetching ...

Reducing Hyperparameter Tuning Costs in ML, Vision and Language Model Training Pipelines via Memoization-Awareness

Abdelmajid Essofi, Ridwan Salahuddeen, Munachiso Nwadike, Elnura Zhalieva, Kun Zhang, Eric Xing, Willie Neiswanger, Qirong Ho

TL;DR

This paper proposes a "memoization-aware"Bayesian Optimization (BO) algorithm that works in tandem with a pipeline caching system, EEIPU, that works in tandem with a pipeline caching system, allowing it to evaluate significantly more hyperparameter candidates per GPU-day than other tuning algorithms.

Abstract

The training or fine-tuning of machine learning, vision, and language models is often implemented as a pipeline: a sequence of stages encompassing data preparation, model training and evaluation. In this paper, we exploit pipeline structures to reduce the cost of hyperparameter tuning for model training/fine-tuning, which is particularly valuable for language models given their high costs in GPU-days. We propose a "memoization-aware" Bayesian Optimization (BO) algorithm, EEIPU, that works in tandem with a pipeline caching system, allowing it to evaluate significantly more hyperparameter candidates per GPU-day than other tuning algorithms. The result is better-quality hyperparameters in the same amount of search time, or equivalently, reduced search time to reach the same hyperparameter quality. In our benchmarks on machine learning (model ensembles), vision (convolutional architecture) and language (T5 architecture) pipelines, we compare EEIPU against recent BO algorithms: EEIPU produces an average of $103\%$ more hyperparameter candidates (within the same budget), and increases the validation metric by an average of $108\%$ more than other algorithms (where the increase is measured starting from the end of warm-up iterations).

Reducing Hyperparameter Tuning Costs in ML, Vision and Language Model Training Pipelines via Memoization-Awareness

TL;DR

This paper proposes a "memoization-aware"Bayesian Optimization (BO) algorithm that works in tandem with a pipeline caching system, EEIPU, that works in tandem with a pipeline caching system, allowing it to evaluate significantly more hyperparameter candidates per GPU-day than other tuning algorithms.

Abstract

The training or fine-tuning of machine learning, vision, and language models is often implemented as a pipeline: a sequence of stages encompassing data preparation, model training and evaluation. In this paper, we exploit pipeline structures to reduce the cost of hyperparameter tuning for model training/fine-tuning, which is particularly valuable for language models given their high costs in GPU-days. We propose a "memoization-aware" Bayesian Optimization (BO) algorithm, EEIPU, that works in tandem with a pipeline caching system, allowing it to evaluate significantly more hyperparameter candidates per GPU-day than other tuning algorithms. The result is better-quality hyperparameters in the same amount of search time, or equivalently, reduced search time to reach the same hyperparameter quality. In our benchmarks on machine learning (model ensembles), vision (convolutional architecture) and language (T5 architecture) pipelines, we compare EEIPU against recent BO algorithms: EEIPU produces an average of more hyperparameter candidates (within the same budget), and increases the validation metric by an average of more than other algorithms (where the increase is measured starting from the end of warm-up iterations).

Paper Structure

This paper contains 29 sections, 3 equations, 11 figures, 15 tables, 1 algorithm.

Figures (11)

  • Figure 1: Memoization-aware hyperparameter tuning, explained via a "breadcrumb" analogy. By caching pipeline stage outputs from earlier hyperparameter runs (the breadcrumbs), the cost of hyperparameter search on later stages is reduced if they reuse the earlier stage outputs. Our goal is to reduce the high cost of hyperparameter search in language, vision and ML pipelines; memoization-aware algorithms achieve this by reusing cached stage outputs, exploring later-stage hyperparameters at a fraction of their regular cost.
  • Figure 2: Our EEIPU algorithm plus memoization (caching). Our method fits $k+1$ GP models (one objective model and $k$ stage-wise cost models), given: (1) a set of $N$ observations (previous hyperparameter runs), (2) their corresponding objective values (model quality/validation scores) and observed runtime costs (on each of the $k$ pipeline stages), and (3) the remaining optimization budget. Each iteration of EEIPU selects a new hyperparameter candidate as follows: first, we randomly generate $M$ potential hyperparameter candidates $X^M$, preferring to re-use hyperparameter prefixes currently in the memoization cache. Then, we estimate these $M$ candidates' objective and stage-wise costs through MC sampling on the respective GP models. Memoized stages, if any, have their costs discarded by the memoization gate, leaving only non-memoized stages when compting the expected inverse cost $\mathbb{E}[1/C(x)]$. This expected inverse cost is raised to the power $\eta$ as a "cost-cooling" mechanism lee2020cost, which shifts attention towards high-objective exploitation over low-cost exploration as the search budget depletes. Multiplying (1) with (2) gives the EEIPU acquisition function, which we maximize (by simply ranking all $M$ candidates) to return the best hyperparameter candidate for evaluation. Finally, the cache is updated to hold the "hyperparameter prefix set" of the top-$5$ performing candidates seen so far, along with their stage-wise outputs $\mathbf{O}(x_{i,k}) \forall k \in \{1,2,...,K-1\}$. See Section \ref{['sec:methods']} for details.
  • Figure 3: Prefix pooling for 1 observation (subfigure (a)) in a 3-stage pipeline. Each stage has a single parameter, respectively represented by $x_1,x_2$, and $x_3$, with corresponding costs $c_1,c_2,c_3$. Since the query in (a) is completed, both the cost and output value of each stage are stored. Subfigures (c) & (d) show how first 2 stages of the observation are cached as prefixes to avoid rerunning them. "?" indicates un-memoized stages, and the empty prefix (subfigure (b)) is used for open-ended search. This process is independently applied to every observation chosen by the acquisition function.
  • Figure 4: Real pipelines, Top to Bottom: Stacking, Segmentation, and T5-small Pipelines. Left plots: The best objective value ($f(x^*)$) achieved by each method within the cumulative consumed budget, where a quicker progression to the top means a higher objective value achieved at a relatively lower cost. Middle plots: The best objective value achieved with respect to the iteration count. Right plots: The incurred cost per iteration.
  • Figure 5: Best-achieved objective value w.r.t. the chosen number of cached observations.
  • ...and 6 more figures