Table of Contents
Fetching ...

Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning

Bei Li, Tong Zheng, Rui Wang, Jiahao Liu, Qingyan Guo, Junliang Guo, Xu Tan, Tong Xiao, Jingbo Zhu, Jingang Wang, Xunliang Cai

TL;DR

A predictor-corrector learning framework to minimize truncation errors, which consists of a high-order predictor and a multistep corrector, and an exponential moving average-based coefficient learning method to strengthen the higher-order predictor.

Abstract

Residual networks, as discrete approximations of Ordinary Differential Equations (ODEs), have inspired significant advancements in neural network design, including multistep methods, high-order methods, and multi-particle dynamical systems. The precision of the solution to ODEs significantly affects parameter optimization, thereby impacting model performance. In this work, we present a series of advanced explorations of Transformer architecture design to minimize the error compared to the true ``solution.'' First, we introduce a predictor-corrector learning framework to minimize truncation errors, which consists of a high-order predictor and a multistep corrector. Second, we propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor. Extensive experiments on large-scale machine translation, abstractive summarization, language modeling, and natural language understanding benchmarks demonstrate the superiority of our approach. On the WMT'14 English-German and English-French tasks, our model achieved BLEU scores of 30.95 and 44.27, respectively. Furthermore, on the OPUS multilingual machine translation task, our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters. Notably, it also beats LLama models by 5.7 accuracy points on the LM Harness Evaluation.

Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning

TL;DR

A predictor-corrector learning framework to minimize truncation errors, which consists of a high-order predictor and a multistep corrector, and an exponential moving average-based coefficient learning method to strengthen the higher-order predictor.

Abstract

Residual networks, as discrete approximations of Ordinary Differential Equations (ODEs), have inspired significant advancements in neural network design, including multistep methods, high-order methods, and multi-particle dynamical systems. The precision of the solution to ODEs significantly affects parameter optimization, thereby impacting model performance. In this work, we present a series of advanced explorations of Transformer architecture design to minimize the error compared to the true ``solution.'' First, we introduce a predictor-corrector learning framework to minimize truncation errors, which consists of a high-order predictor and a multistep corrector. Second, we propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor. Extensive experiments on large-scale machine translation, abstractive summarization, language modeling, and natural language understanding benchmarks demonstrate the superiority of our approach. On the WMT'14 English-German and English-French tasks, our model achieved BLEU scores of 30.95 and 44.27, respectively. Furthermore, on the OPUS multilingual machine translation task, our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters. Notably, it also beats LLama models by 5.7 accuracy points on the LM Harness Evaluation.

Paper Structure

This paper contains 51 sections, 8 equations, 6 figures, 14 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of several advanced numerical methods and our proposed predictor-corrector paradigm. The right part plots a 4-order method as the predictor to obtain $P_{t+1}$; $F_{t+1}$ is then estimated via a function $\mathcal{F(\cdot)}$; A 4-step method as the corrector to obtain the $y_{t+1}$.
  • Figure 2: Truncation errors with different intermediate approximations.
  • Figure 3: The comparison of BLEU as well as model capacities and training costs against previous state-of-the-art deep transformers.
  • Figure 4: The comparison of training and validation PPL on base and wide models.
  • Figure 5: The coefficient learning curves of independent initialization and EMA oin both 2-order and 4-order scenarios. The experiments are conducted on WMT En-De.
  • ...and 1 more figures