Table of Contents
Fetching ...

On Exact Bit-level Reversible Transformers Without Changing Architectures

Guoqiang Zhang, J. P. Lewis, W. B. Kleijn

TL;DR

BDIA-transformer enables exact bit-level reversibility for standard transformer architectures by training with per-block random γ values in {0.5, -0.5}, which regularize the model as an ensemble of ODE solvers. Inference uses E[γ] = 0, recovering the conventional transformer update, while activation quantization and lightweight side information ensure lossless online back-propagation. The approach yields improved validation performance and substantially lower training memory across image classification, machine translation, and language modeling tasks. This combination of BDIA-based training and quantization extends reversible DNN concepts to standard transformer architectures with practical memory savings and generalization benefits.

Abstract

Various reversible deep neural networks (DNN) models have been proposed to reduce memory consumption in the training process. However, almost all existing reversible DNNs either require special non-standard architectures or are constructed by modifying existing DNN architectures considerably to enable reversibility. In this work we present the BDIA-transformer, which is an exact bit-level reversible transformer that uses an unchanged standard architecture for inference. The basic idea is to first treat each transformer block as the Euler integration approximation for solving an ordinary differential equation (ODE) and then incorporate the technique of bidirectional integration approximation (BDIA) into the neural architecture, together with activation quantization to make it exactly bit-level reversible. In the training process, we let a hyper-parameter $γ$ in BDIA-transformer randomly take one of the two values $\{0.5, -0.5\}$ per training sample per transformer block for averaging every two consecutive integration approximations. As a result, BDIA-transformer can be viewed as training an ensemble of ODE solvers parameterized by a set of binary random variables, which regularizes the model and results in improved validation accuracy. Lightweight side information per transformer block is required to be stored in the forward process to account for binary quantization loss to enable exact bit-level reversibility. In the inference procedure, the expectation $\mathbb{E}(γ)=0$ is taken to make the resulting architectures of BDIA-transformer identical to transformers up to activation quantization. Our experiments in both image classification and language translation show that BDIA-transformers outperform their conventional counterparts significantly in terms of validation performance while also requiring considerably less training memory.

On Exact Bit-level Reversible Transformers Without Changing Architectures

TL;DR

BDIA-transformer enables exact bit-level reversibility for standard transformer architectures by training with per-block random γ values in {0.5, -0.5}, which regularize the model as an ensemble of ODE solvers. Inference uses E[γ] = 0, recovering the conventional transformer update, while activation quantization and lightweight side information ensure lossless online back-propagation. The approach yields improved validation performance and substantially lower training memory across image classification, machine translation, and language modeling tasks. This combination of BDIA-based training and quantization extends reversible DNN concepts to standard transformer architectures with practical memory savings and generalization benefits.

Abstract

Various reversible deep neural networks (DNN) models have been proposed to reduce memory consumption in the training process. However, almost all existing reversible DNNs either require special non-standard architectures or are constructed by modifying existing DNN architectures considerably to enable reversibility. In this work we present the BDIA-transformer, which is an exact bit-level reversible transformer that uses an unchanged standard architecture for inference. The basic idea is to first treat each transformer block as the Euler integration approximation for solving an ordinary differential equation (ODE) and then incorporate the technique of bidirectional integration approximation (BDIA) into the neural architecture, together with activation quantization to make it exactly bit-level reversible. In the training process, we let a hyper-parameter in BDIA-transformer randomly take one of the two values per training sample per transformer block for averaging every two consecutive integration approximations. As a result, BDIA-transformer can be viewed as training an ensemble of ODE solvers parameterized by a set of binary random variables, which regularizes the model and results in improved validation accuracy. Lightweight side information per transformer block is required to be stored in the forward process to account for binary quantization loss to enable exact bit-level reversibility. In the inference procedure, the expectation is taken to make the resulting architectures of BDIA-transformer identical to transformers up to activation quantization. Our experiments in both image classification and language translation show that BDIA-transformers outperform their conventional counterparts significantly in terms of validation performance while also requiring considerably less training memory.
Paper Structure (12 sections, 16 equations, 5 figures, 2 tables)

This paper contains 12 sections, 16 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Validation performance of different ODE solvers parameterized by a single $\gamma$ parameter after training ViT and BDIA-ViT over CIFAR10. See Subsection \ref{['subsec:exp_vit']} on how ViT and BDIA-ViT were trained. Each ODE solver in the inference procedure is realized by selecting $\gamma$ from $[-0.5, 0.5]$, which is fixed across all the transformer blocks for the same input image. The validation performance of BDIA-ViT is more robust than that of ViT.
  • Figure 2: Demonstration of the accumulated reconstruction error by following (\ref{['equ:BDIA_reverse']}) with the setup $\gamma_k\in\{0.5, -0.5\}$, $k=1,\ldots, K{-}1$, when training BDIA-GPT2 with 12 transformer blocks.
  • Figure 3: Performance comparison of ViT, RevViT Mangalam23ReverViT, and BDIA-ViT for image classification over CIFAR10 and CIFAR100. $\{\gamma_k\}_{k=1}^{K-1}$ in the training procedure of BDIA-ViT were drawn from $\{\pm0.5\}$ per training sample.
  • Figure 4: Performance comparison for English to French translation. $\{\gamma_k\}_{k=1}^{K-1}$ in the training procedure of BDIA-ViT were randomly drawn from $\{\pm0.5\}$ per training sample.
  • Figure 5: Performance comparison when training GPT2. $\{\gamma_k\}_{k=1}^{K-1}$ in the training procedure of BDIA-ViT were randomly drawn from $\{\pm0.5\}$ per training sample.

Theorems & Definitions (2)

  • Remark 1
  • Remark 2