Table of Contents
Fetching ...

Crafting Efficient Fine-Tuning Strategies for Large Language Models

Michael Oliver, Guan Wang

TL;DR

This work tackles data efficiency and hyperparameter optimization in fine-tuning large language models for domain-specific information extraction from e-commerce web pages. It demonstrates that very small datasets (around 200 samples) can substantially boost accuracy and identifies a data-saturation point near 6,500 samples. The authors propose an early-performance-driven Bayesian optimization approach that evaluates models at 20% of training and shows strong predictive correlation with final performance, achieving about 2% additional accuracy over baselines on independent data. Collectively, the results offer practical, compute-conscious guidance for efficiently fine-tuning LLMs with LoRA while preserving high performance on real-world extraction tasks.

Abstract

This paper addresses the challenges of efficiently fine-tuning large language models (LLMs) by exploring data efficiency and hyperparameter optimization. We investigate the minimum data required for effective fine-tuning and propose a novel hyperparameter optimization method that leverages early-stage model performance. Our experiments demonstrate that fine-tuning with as few as 200 samples can improve model accuracy from 70\% to 88\% in a product attribute extraction task. We identify a saturation point of approximately 6,500 samples, beyond which additional data yields diminishing returns. Our proposed bayesian hyperparameter optimization method, which evaluates models at 20\% of total training time, correlates strongly with final model performance, with 4 out of 5 top early-stage models remaining in the top 5 at completion. This approach led to a 2\% improvement in accuracy over baseline models when evaluated on an independent test set. These findings offer actionable insights for practitioners, potentially reducing computational load and dependency on extensive datasets while enhancing overall performance of fine-tuned LLMs.

Crafting Efficient Fine-Tuning Strategies for Large Language Models

TL;DR

This work tackles data efficiency and hyperparameter optimization in fine-tuning large language models for domain-specific information extraction from e-commerce web pages. It demonstrates that very small datasets (around 200 samples) can substantially boost accuracy and identifies a data-saturation point near 6,500 samples. The authors propose an early-performance-driven Bayesian optimization approach that evaluates models at 20% of training and shows strong predictive correlation with final performance, achieving about 2% additional accuracy over baselines on independent data. Collectively, the results offer practical, compute-conscious guidance for efficiently fine-tuning LLMs with LoRA while preserving high performance on real-world extraction tasks.

Abstract

This paper addresses the challenges of efficiently fine-tuning large language models (LLMs) by exploring data efficiency and hyperparameter optimization. We investigate the minimum data required for effective fine-tuning and propose a novel hyperparameter optimization method that leverages early-stage model performance. Our experiments demonstrate that fine-tuning with as few as 200 samples can improve model accuracy from 70\% to 88\% in a product attribute extraction task. We identify a saturation point of approximately 6,500 samples, beyond which additional data yields diminishing returns. Our proposed bayesian hyperparameter optimization method, which evaluates models at 20\% of total training time, correlates strongly with final model performance, with 4 out of 5 top early-stage models remaining in the top 5 at completion. This approach led to a 2\% improvement in accuracy over baseline models when evaluated on an independent test set. These findings offer actionable insights for practitioners, potentially reducing computational load and dependency on extensive datasets while enhancing overall performance of fine-tuned LLMs.
Paper Structure (20 sections, 1 equation, 4 figures, 9 tables, 1 algorithm)

This paper contains 20 sections, 1 equation, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: Accuracy for different attributes and average accuracy, for different amounts of training data.
  • Figure 2: Accuracy score vs model number $n$. The accuracy of the models produced at $t_1$, $t_2$, and at the evaluation minima are shown together. In the early stages, where the study is in the exploration phase, more of the tested models have 0 accuracy. Only 3 out of 30 models failed in the exploitation phase.
  • Figure 3: Learning Rate vs LoRA $\alpha$ with resulting model accuracies. Models with high values for both parameters had significant fitting issue, producing useless models with 0 accuracy.
  • Figure 4: Validation Loss vs Epoch for a sample of models. Models for which the training failed significantly are excluded.