Investigating the Effectiveness of HyperTuning via Gisting
Jason Phang
TL;DR
This work introduces hypertuning, a paradigm in which a hypermodel generates task-specific PEFT parameters for a frozen downstream Transformer, enabling adaptation from few-shot inputs without backpropagating through the LM. The authors instantiate this with HyperT5-Prefix and HyperT5-LoRA, training in two stages: hyperpretraining using a CACLM objective and multi-task fine-tuning to generalize to unseen tasks. Across P3, MetaICL, and S-NI, hypermodels produce competitive PEFT parameters that improve over standard PEFT baselines and can serve as strong initializations for subsequent fine-tuning, though they generally lag full attention-based few-shot or multi-task fine-tuned models. The results suggest hypertuning is economical and practical for rapid task adaptation, with potential for improved parameter initialization and faster convergence, while highlighting areas for future gains, such as closer performance to full-scale few-shot models and optimized hyperpretraining strategies.
Abstract
Gisting (Mu et al., 2023) is a simple method for training models to compress information into fewer token representations using a modified attention mask, and can serve as an economical approach to training Transformer-based hypernetworks. We introduce HyperLlama, a set of Gisting-based hypernetworks built on Llama-2 models that generates task-specific soft prefixes based on few-shot inputs. In experiments across P3, Super-NaturalInstructions and Symbol Tuning datasets, we show that HyperLlama models can effectively compress information from few-shot examples into soft prefixes. However, they still underperform multi-task fine-tuned language models with full attention over few-shot in-context examples. We also show that HyperLlama-generated soft prefixes can serve as better initializations for further prefix tuning. Overall, Gisting-based hypernetworks are economical and easy to implement, but have mixed empirical performance.
