Table of Contents
Fetching ...

Faster Convergence for Transformer Fine-tuning with Line Search Methods

Philip Kenneweg, Leonardo Galli, Tristan Kenneweg, Barbara Hammer

TL;DR

This work combines the Armijo line search with the Adam optimizer and extends it by subdividing the networks architecture into sensible units and perform the line search separately on these local units.

Abstract

Recent works have shown that line search methods greatly increase performance of traditional stochastic gradient descent methods on a variety of datasets and architectures [1], [2]. In this work we succeed in extending line search methods to the novel and highly popular Transformer architecture and dataset domains in natural language processing. More specifically, we combine the Armijo line search with the Adam optimizer and extend it by subdividing the networks architecture into sensible units and perform the line search separately on these local units. Our optimization method outperforms the traditional Adam optimizer and achieves significant performance improvements for small data sets or small training budgets, while performing equal or better for other tested cases. Our work is publicly available as a python package, which provides a hyperparameter-free pytorch optimizer that is compatible with arbitrary network architectures.

Faster Convergence for Transformer Fine-tuning with Line Search Methods

TL;DR

This work combines the Armijo line search with the Adam optimizer and extends it by subdividing the networks architecture into sensible units and perform the line search separately on these local units.

Abstract

Recent works have shown that line search methods greatly increase performance of traditional stochastic gradient descent methods on a variety of datasets and architectures [1], [2]. In this work we succeed in extending line search methods to the novel and highly popular Transformer architecture and dataset domains in natural language processing. More specifically, we combine the Armijo line search with the Adam optimizer and extend it by subdividing the networks architecture into sensible units and perform the line search separately on these local units. Our optimization method outperforms the traditional Adam optimizer and achieves significant performance improvements for small data sets or small training budgets, while performing equal or better for other tested cases. Our work is publicly available as a python package, which provides a hyperparameter-free pytorch optimizer that is compatible with arbitrary network architectures.
Paper Structure (18 sections, 9 equations, 9 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 9 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: Exemplary problematic step size of component 9 of the network during a single training run on the QNLI dataset. The step size starts out in the order of $10^{-4}$ but is lowered to values below $10^{-50}$.
  • Figure 2: Different merging thresholds on one epoch of the MNLI dataset. Standard error is indicated around each line.
  • Figure 3: The loss curves of the experiments on the small dataset with standard error indicated around each line. In all experiments SGDSLS performs the worst, followed by ADAM. PLASLS and ADAMSLS do not perform very different. In the SST2 and QNLI experiment PLASLS performs best, while in the MNLI experiment ADAMSLS performs best. In the MRPC experiments both perform about the same.
  • Figure 4: The accuracy curves of the experiments on the small dataset with standard error indicated around each line, starting after the first epoch. In all experiments SGDSLS performs the worst, followed by ADAM. PLASLS and ADAMSLS do not perform very different. In the MNLI and MRPC experiment ADAMSLS performs best. In the SST2 and QNLI experiments ADAMSLS and PLASLS perform about the same.
  • Figure 5: The loss curves of the experiments on the full size dataset with standard error indicated around each line. Overall SGDSLS clearly performs worst. In the SST2 experiments PLASLS fails to converge to a very low loss. In the MRPC experiment we can see that ADAMSLS and PLASLS perform better initialy, but ADAM performs about the same in the end.
  • ...and 4 more figures