Table of Contents
Fetching ...

Investigating the Effectiveness of HyperTuning via Gisting

Jason Phang

TL;DR

This work introduces hypertuning, a paradigm in which a hypermodel generates task-specific PEFT parameters for a frozen downstream Transformer, enabling adaptation from few-shot inputs without backpropagating through the LM. The authors instantiate this with HyperT5-Prefix and HyperT5-LoRA, training in two stages: hyperpretraining using a CACLM objective and multi-task fine-tuning to generalize to unseen tasks. Across P3, MetaICL, and S-NI, hypermodels produce competitive PEFT parameters that improve over standard PEFT baselines and can serve as strong initializations for subsequent fine-tuning, though they generally lag full attention-based few-shot or multi-task fine-tuned models. The results suggest hypertuning is economical and practical for rapid task adaptation, with potential for improved parameter initialization and faster convergence, while highlighting areas for future gains, such as closer performance to full-scale few-shot models and optimized hyperpretraining strategies.

Abstract

Gisting (Mu et al., 2023) is a simple method for training models to compress information into fewer token representations using a modified attention mask, and can serve as an economical approach to training Transformer-based hypernetworks. We introduce HyperLlama, a set of Gisting-based hypernetworks built on Llama-2 models that generates task-specific soft prefixes based on few-shot inputs. In experiments across P3, Super-NaturalInstructions and Symbol Tuning datasets, we show that HyperLlama models can effectively compress information from few-shot examples into soft prefixes. However, they still underperform multi-task fine-tuned language models with full attention over few-shot in-context examples. We also show that HyperLlama-generated soft prefixes can serve as better initializations for further prefix tuning. Overall, Gisting-based hypernetworks are economical and easy to implement, but have mixed empirical performance.

Investigating the Effectiveness of HyperTuning via Gisting

TL;DR

This work introduces hypertuning, a paradigm in which a hypermodel generates task-specific PEFT parameters for a frozen downstream Transformer, enabling adaptation from few-shot inputs without backpropagating through the LM. The authors instantiate this with HyperT5-Prefix and HyperT5-LoRA, training in two stages: hyperpretraining using a CACLM objective and multi-task fine-tuning to generalize to unseen tasks. Across P3, MetaICL, and S-NI, hypermodels produce competitive PEFT parameters that improve over standard PEFT baselines and can serve as strong initializations for subsequent fine-tuning, though they generally lag full attention-based few-shot or multi-task fine-tuned models. The results suggest hypertuning is economical and practical for rapid task adaptation, with potential for improved parameter initialization and faster convergence, while highlighting areas for future gains, such as closer performance to full-scale few-shot models and optimized hyperpretraining strategies.

Abstract

Gisting (Mu et al., 2023) is a simple method for training models to compress information into fewer token representations using a modified attention mask, and can serve as an economical approach to training Transformer-based hypernetworks. We introduce HyperLlama, a set of Gisting-based hypernetworks built on Llama-2 models that generates task-specific soft prefixes based on few-shot inputs. In experiments across P3, Super-NaturalInstructions and Symbol Tuning datasets, we show that HyperLlama models can effectively compress information from few-shot examples into soft prefixes. However, they still underperform multi-task fine-tuned language models with full attention over few-shot in-context examples. We also show that HyperLlama-generated soft prefixes can serve as better initializations for further prefix tuning. Overall, Gisting-based hypernetworks are economical and easy to implement, but have mixed empirical performance.
Paper Structure (33 sections, 4 equations, 9 figures)

This paper contains 33 sections, 4 equations, 9 figures.

Figures (9)

  • Figure 1: Overview of HyperTuning. (A) Fine-tuning, where all model parameters are updated (red). (B) Parameter-efficient fine-tuning (PEFT), where all model parameters are frozen (blue) and only a small number of parameters, $\phi$, are updated. (C) HyperTuning, where a hypermodel is used to generate parameters $\phi$ for a frozen downstream model. For instance, a hypermodel may take a set of few-shot examples to determine what $\phi$ to generate. Only the hypermodel's parameters are updated during training. (D) At inference time, the parameters $\phi$ only need to be generated once, and thereafter only need to store $\phi$, with no need to retain the few-shot examples.
  • Figure 2: Overview of HyperT5. (A) HyperT5 takes as input few-shot examples and outputs PEFT parameters $\phi$. The model is initialized from an LM-adapted T5. (B) In HyperT5-Prefix, $\phi$ are key and value prefixes for every attention layer. (C) In HyperT5-LoRA, $\phi$ are additive low-rank modifications to the query and value linear maps.
  • Figure 3: Overview of HyperPretraining using the Context-Augmented Conditional Language Modeling (CACLM) objective to train a hypermodel to predict PEFT parameters $\phi$. (A) Sample a sequence of 512 tokens from a pretraining corpus, and splice into 4 segments A--D. (B) The frozen downstream model takes as input B and predicts continuation C. (C) The hypermodel is trained to encode additional context A and D into PEFT parameters $\phi$, providing additional information to the downstream model to predict C.
  • Figure 4: Performance of HyperT5 models on P3 evaluation with different amounts of hyperpretraining. HyperPretraining is crucial for good performance of the hypermodels. However, hyperpretraining for too many steps can also hurt performance (as see in the case of HyperT5-LoRA).
  • Figure 5: Average performance on P3 held-out tasks with prefix tuning and LoRA, using different parameter initializations. Using hypermodel-generated initializations starts with higher performance and continues to perform better on average over the course of training.
  • ...and 4 more figures