BiPFT: Binary Pre-trained Foundation Transformer with Low-rank Estimation of Binarization Residual Polynomials

Xingrun Xing; Li Du; Xinyuan Wang; Xianlin Zeng; Yequan Wang; Zheng Zhang; Jiajun Zhang

BiPFT: Binary Pre-trained Foundation Transformer with Low-rank Estimation of Binarization Residual Polynomials

Xingrun Xing, Li Du, Xinyuan Wang, Xianlin Zeng, Yequan Wang, Zheng Zhang, Jiajun Zhang

TL;DR

This work introduces BiPFT, the first binary pretrained foundation transformer for natural language understanding, addressing the computational and memory bottlenecks of large FP models by training and finetuning entirely in binary. It combines a strong binary baseline with data-driven binarization of self-attention via binarization residual polynomials modeled by low-rank estimators, enabling effective 1-bit representations. The BiPFT-A model achieves a substantial GLUE gain (13.9% average) over a binary baseline, and BiPFT-B further improves by 1.6% through residual polynomial estimation, achieving a total average improvement of 15.4% and dramatic efficiency benefits ($56\times$ fewer operations, $28\times$ memory) while reducing distillation dependence. The results demonstrate that pretraining binary transformers yields robust, task-agnostic knowledge transfer to downstream NLU tasks, with potential extensions to binary NLP generation tasks in the future.

Abstract

Pretrained foundation models offer substantial benefits for a wide range of downstream tasks, which can be one of the most potential techniques to access artificial general intelligence. However, scaling up foundation transformers for maximal task-agnostic knowledge has brought about computational challenges, especially on resource-limited devices such as mobiles. This work proposes the first Binary Pretrained Foundation Transformer (BiPFT) for natural language understanding (NLU) tasks, which remarkably saves 56 times operations and 28 times memory. In contrast to previous task-specific binary transformers, BiPFT exhibits a substantial enhancement in the learning capabilities of binary neural networks (BNNs), promoting BNNs into the era of pre-training. Benefiting from extensive pretraining data, we further propose a data-driven binarization method. Specifically, we first analyze the binarization error in self-attention operations and derive the polynomials of binarization error. To simulate full-precision self-attention, we define binarization error as binarization residual polynomials, and then introduce low-rank estimators to model these polynomials. Extensive experiments validate the effectiveness of BiPFTs, surpassing task-specific baseline by 15.4% average performance on the GLUE benchmark. BiPFT also demonstrates improved robustness to hyperparameter changes, improved optimization efficiency, and reduced reliance on downstream distillation, which consequently generalize on various NLU tasks and simplify the downstream pipeline of BNNs. Our code and pretrained models are publicly available at https://github.com/Xingrun-Xing/BiPFT.

BiPFT: Binary Pre-trained Foundation Transformer with Low-rank Estimation of Binarization Residual Polynomials

TL;DR

fewer operations,

memory) while reducing distillation dependence. The results demonstrate that pretraining binary transformers yields robust, task-agnostic knowledge transfer to downstream NLU tasks, with potential extensions to binary NLP generation tasks in the future.

Abstract

Paper Structure (16 sections, 18 equations, 4 figures, 8 tables)

This paper contains 16 sections, 18 equations, 4 figures, 8 tables.

Introduction
Related Work
Methodology
Build Binary Baseline Architecture
Pretrain Binary Transformers
Estimate Binarization Polynomials
Experiments
Experiment Settings
Main Results
Ablation Studies
Conclusion
Acknowledgments
Appendix
A. Discussion for Baseline Settings
B. Robustness of Binary Transformers
...and 1 more sections

Figures (4)

Figure 1: Comparison of training pipelines for binary transformers. FP indicates full-precision. For downstream tasks, finetuning BiPFT replaces previous task-specific pipelines.
Figure 2: Comparisons of BiPFT-A and baselines in different batch sizes. Up: baseline with task-specific distillation; down: baseline without task-specific distillation.
Figure 3: Pertraining performance in different training steps.
Figure 4: Comparisons of BiPFT-A and baselines in different learning rates. Up: baseline with task-specific distillation; down: baseline without task-specific distillation. We set the base learning rates for baselines according to searched results of BiTs for every task; we set learning rates for BiPFT-A from $\{5\times10^\text{-6}, 1\times10^\text{-5}, 2\times10^\text{-5}\}$.

BiPFT: Binary Pre-trained Foundation Transformer with Low-rank Estimation of Binarization Residual Polynomials

TL;DR

Abstract

BiPFT: Binary Pre-trained Foundation Transformer with Low-rank Estimation of Binarization Residual Polynomials

Authors

TL;DR

Abstract

Table of Contents

Figures (4)