LoGAH: Predicting 774-Million-Parameter Transformers using Graph HyperNetworks with 1/100 Parameters

Xinyu Zhou; Boris Knyazev; Alexia Jolicoeur-Martineau; Jie Fu

LoGAH: Predicting 774-Million-Parameter Transformers using Graph HyperNetworks with 1/100 Parameters

Xinyu Zhou, Boris Knyazev, Alexia Jolicoeur-Martineau, Jie Fu

TL;DR

LoGAH tackles the high cost of pretraining large transformer models by predicting initialization parameters with a memory-efficient, low-rank Graph HyperNetwork (GHN) decoder. It overcomes the parameter-copy bottleneck of prior GHNs, enabling parameter prediction for significantly wider networks with a decoder whose parameter count scales as $O(d^2)$ rather than $O(d^3)$. Empirically, LoGAH improves initializations for ViT and GPT-2 over random starts and GHN-3, and exhibits transfer learning capabilities across datasets, including enabling a 2.5M-parameter LoGAH to seed a 774M-parameter GPT-2-Large. This work demonstrates practical potential for reducing pretraining costs while maintaining or improving downstream performance in vision and language transformers.

Abstract

A good initialization of deep learning models is essential since it can help them converge better and faster. However, pretraining large models is unaffordable for many researchers, which makes a desired prediction for initial parameters more necessary nowadays. Graph HyperNetworks (GHNs), one approach to predicting model parameters, have recently shown strong performance in initializing large vision models. Unfortunately, predicting parameters of very wide networks relies on copying small chunks of parameters multiple times and requires an extremely large number of parameters to support full prediction, which greatly hinders its adoption in practice. To address this limitation, we propose LoGAH (Low-rank GrAph Hypernetworks), a GHN with a low-rank parameter decoder that expands to significantly wider networks without requiring as excessive increase of parameters as in previous attempts. LoGAH allows us to predict the parameters of 774-million large neural networks in a memory-efficient manner. We show that vision and language models (i.e., ViT and GPT-2) initialized with LoGAH achieve better performance than those initialized randomly or using existing hypernetworks. Furthermore, we show promising transfer learning results w.r.t. training LoGAH on small datasets and using the predicted parameters to initialize for larger tasks. We provide the codes in https://github.com/Blackzxy/LoGAH .

LoGAH: Predicting 774-Million-Parameter Transformers using Graph HyperNetworks with 1/100 Parameters

TL;DR

rather than

. Empirically, LoGAH improves initializations for ViT and GPT-2 over random starts and GHN-3, and exhibits transfer learning capabilities across datasets, including enabling a 2.5M-parameter LoGAH to seed a 774M-parameter GPT-2-Large. This work demonstrates practical potential for reducing pretraining costs while maintaining or improving downstream performance in vision and language transformers.

Abstract

Paper Structure (24 sections, 10 equations, 8 figures, 7 tables)

This paper contains 24 sections, 10 equations, 8 figures, 7 tables.

Introduction
Preliminaries
Graph HyperNetworks
GHN Decoder
Scalable Graph HyperNetworks: LoGAH
Low-Rank Decoder
Predicting parameters in larger shapes with fewer parameters
ViTs-1K and GPTs-1K Datasets
Experiments
ViT Experiments.
Overall Comparision on CIFAR-10, CIFAR-100 and ImageNet
Effect of meta-batch size $m$ on LoGAH
GPT-2 Experiments
Qualitative Analysis
Transfer Learning Experiments
...and 9 more sections

Figures (8)

Figure 1: Comparison of parameter counts between GHN-3 and LoGAH. GHN-3 requires a larger hidden size to support wider networks, which increases the size of GHN-3 exponentially in Figure \ref{['param_width']}. LoGAH can support much wider networks (up to 2048-dimension), and larger networks (GPT-2-Large in 1280-dimension with 774M parameters) even using LoGAH-Tiny.
Figure 2: CIFAR-10 top-1 accuracy (%) on ViT-Small and ViT-Base where LoGAH is trained with different meta-batch size $m$.
Figure 3: GPT-2 transfer learning experiments. LoGAH are trained on WikiText-2 and GPT-2 models are fine-tuned on WikiText-103 based on LoGAH's predicted parameters.
Figure 4: ViT transfer learning experiments. We use LoGAH trained on CIFAR-10 (resp. CIFAR-100) to predict ViT's parameters, then ViT is trained on CIFAR-100 (resp. ImageNet). T, S, B and L denotes Tiny, Small, Base and Large versions of LoGAH respectively.
Figure 5: Code for Low-rank decoder in LoGAH.
...and 3 more figures

LoGAH: Predicting 774-Million-Parameter Transformers using Graph HyperNetworks with 1/100 Parameters

TL;DR

Abstract

LoGAH: Predicting 774-Million-Parameter Transformers using Graph HyperNetworks with 1/100 Parameters

Authors

TL;DR

Abstract

Table of Contents

Figures (8)