Investigating the Effectiveness of HyperTuning via Gisting

Jason Phang

Investigating the Effectiveness of HyperTuning via Gisting

Jason Phang

TL;DR

This work introduces hypertuning, a paradigm in which a hypermodel generates task-specific PEFT parameters for a frozen downstream Transformer, enabling adaptation from few-shot inputs without backpropagating through the LM. The authors instantiate this with HyperT5-Prefix and HyperT5-LoRA, training in two stages: hyperpretraining using a CACLM objective and multi-task fine-tuning to generalize to unseen tasks. Across P3, MetaICL, and S-NI, hypermodels produce competitive PEFT parameters that improve over standard PEFT baselines and can serve as strong initializations for subsequent fine-tuning, though they generally lag full attention-based few-shot or multi-task fine-tuned models. The results suggest hypertuning is economical and practical for rapid task adaptation, with potential for improved parameter initialization and faster convergence, while highlighting areas for future gains, such as closer performance to full-scale few-shot models and optimized hyperpretraining strategies.

Abstract

Gisting (Mu et al., 2023) is a simple method for training models to compress information into fewer token representations using a modified attention mask, and can serve as an economical approach to training Transformer-based hypernetworks. We introduce HyperLlama, a set of Gisting-based hypernetworks built on Llama-2 models that generates task-specific soft prefixes based on few-shot inputs. In experiments across P3, Super-NaturalInstructions and Symbol Tuning datasets, we show that HyperLlama models can effectively compress information from few-shot examples into soft prefixes. However, they still underperform multi-task fine-tuned language models with full attention over few-shot in-context examples. We also show that HyperLlama-generated soft prefixes can serve as better initializations for further prefix tuning. Overall, Gisting-based hypernetworks are economical and easy to implement, but have mixed empirical performance.

Investigating the Effectiveness of HyperTuning via Gisting

TL;DR

Abstract

Paper Structure (33 sections, 4 equations, 9 figures)

This paper contains 33 sections, 4 equations, 9 figures.

Introduction
Related Work
HyperNetworks
Multi-task Training and Transfer
HyperTuning
1) Large language models can perform in-context learning effectively.
2) Large language models can be adapted to downstream tasks by tuning a small set of parameters.
HyperTuning with Fewshot Examples
HyperT5: A T5-Based HyperModel
Architecture and Setup
HyperPretraining
Multi-Task Fine-Tuning with HyperT5
Multitask Fine-Tuning (MTF)
Datasets
Results
...and 18 more sections

Figures (9)

Figure 1: Overview of HyperTuning. (A) Fine-tuning, where all model parameters are updated (red). (B) Parameter-efficient fine-tuning (PEFT), where all model parameters are frozen (blue) and only a small number of parameters, $\phi$, are updated. (C) HyperTuning, where a hypermodel is used to generate parameters $\phi$ for a frozen downstream model. For instance, a hypermodel may take a set of few-shot examples to determine what $\phi$ to generate. Only the hypermodel's parameters are updated during training. (D) At inference time, the parameters $\phi$ only need to be generated once, and thereafter only need to store $\phi$, with no need to retain the few-shot examples.
Figure 2: Overview of HyperT5. (A) HyperT5 takes as input few-shot examples and outputs PEFT parameters $\phi$. The model is initialized from an LM-adapted T5. (B) In HyperT5-Prefix, $\phi$ are key and value prefixes for every attention layer. (C) In HyperT5-LoRA, $\phi$ are additive low-rank modifications to the query and value linear maps.
Figure 3: Overview of HyperPretraining using the Context-Augmented Conditional Language Modeling (CACLM) objective to train a hypermodel to predict PEFT parameters $\phi$. (A) Sample a sequence of 512 tokens from a pretraining corpus, and splice into 4 segments A--D. (B) The frozen downstream model takes as input B and predicts continuation C. (C) The hypermodel is trained to encode additional context A and D into PEFT parameters $\phi$, providing additional information to the downstream model to predict C.
Figure 4: Performance of HyperT5 models on P3 evaluation with different amounts of hyperpretraining. HyperPretraining is crucial for good performance of the hypermodels. However, hyperpretraining for too many steps can also hurt performance (as see in the case of HyperT5-LoRA).
Figure 5: Average performance on P3 held-out tasks with prefix tuning and LoRA, using different parameter initializations. Using hypermodel-generated initializations starts with higher performance and continues to perform better on average over the course of training.
...and 4 more figures

Investigating the Effectiveness of HyperTuning via Gisting

TL;DR

Abstract

Investigating the Effectiveness of HyperTuning via Gisting

Authors

TL;DR

Abstract

Table of Contents

Figures (9)