Table of Contents
Fetching ...

Transfer Learning for Finetuning Large Language Models

Tobias Strangmann, Lennart Purucker, Jörg K. H. Franke, Ivo Rapant, Fabio Ferreira, Frank Hutter

TL;DR

This work transfers learn finetuning by meta-learning performance and cost surrogate models for grey-box meta-optimization from a new meta-dataset and demonstrates the transferability of finetuning to adapt large language models more effectively.

Abstract

As the landscape of large language models expands, efficiently finetuning for specific tasks becomes increasingly crucial. At the same time, the landscape of parameter-efficient finetuning methods rapidly expands. Consequently, practitioners face a multitude of complex choices when searching for an optimal finetuning pipeline for large language models. To reduce the complexity for practitioners, we investigate transfer learning for finetuning large language models and aim to transfer knowledge about configurations from related finetuning tasks to a new task. In this work, we transfer learn finetuning by meta-learning performance and cost surrogate models for grey-box meta-optimization from a new meta-dataset. Counter-intuitively, we propose to rely only on transfer learning for new datasets. Thus, we do not use task-specific Bayesian optimization but prioritize knowledge transferred from related tasks over task-specific feedback. We evaluate our method on eight synthetic question-answer datasets and a meta-dataset consisting of 1,800 runs of finetuning Microsoft's Phi-3. Our transfer learning is superior to zero-shot, default finetuning, and meta-optimization baselines. Our results demonstrate the transferability of finetuning to adapt large language models more effectively.

Transfer Learning for Finetuning Large Language Models

TL;DR

This work transfers learn finetuning by meta-learning performance and cost surrogate models for grey-box meta-optimization from a new meta-dataset and demonstrates the transferability of finetuning to adapt large language models more effectively.

Abstract

As the landscape of large language models expands, efficiently finetuning for specific tasks becomes increasingly crucial. At the same time, the landscape of parameter-efficient finetuning methods rapidly expands. Consequently, practitioners face a multitude of complex choices when searching for an optimal finetuning pipeline for large language models. To reduce the complexity for practitioners, we investigate transfer learning for finetuning large language models and aim to transfer knowledge about configurations from related finetuning tasks to a new task. In this work, we transfer learn finetuning by meta-learning performance and cost surrogate models for grey-box meta-optimization from a new meta-dataset. Counter-intuitively, we propose to rely only on transfer learning for new datasets. Thus, we do not use task-specific Bayesian optimization but prioritize knowledge transferred from related tasks over task-specific feedback. We evaluate our method on eight synthetic question-answer datasets and a meta-dataset consisting of 1,800 runs of finetuning Microsoft's Phi-3. Our transfer learning is superior to zero-shot, default finetuning, and meta-optimization baselines. Our results demonstrate the transferability of finetuning to adapt large language models more effectively.

Paper Structure

This paper contains 11 sections, 1 equation, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Method Overview. We generate new NLP datasets from scientific papers and then create a meta-dataset, which we use for transfer learning to finetune by pre-training Quick-Tune (left). For a new dataset, we compute meta-features and then apply the pre-trained Quick-Tune (right).
  • Figure 2: Our Meta-Dataset. For each run stored in our meta-dataset, represented by a blue circle, we present the accuracy and finetuning time in seconds.
  • Figure 3: Optimizer Performance Over Time. We visualize the average validation (left) and test (right) performance across the eight datasets over time. At each time point, we evaluated the best pipeline found so far. We observe that DEHB and Quick-Tune (default) stagnant after 1 to 1.5 hours, with little progress on test scores afterward. Quick-Tune (ours) only stagnates after 3 hours.
  • Figure 4: Final Performance. We show the validation (left) and test (right) learning curve of the best pipeline returned by the optimizers after 5 hours, averaged across eight datasets. The finetuning pipeline returned by Quick-Tune (ours) performs best.