Table of Contents
Fetching ...

Stay Tuned: An Empirical Study of the Impact of Hyperparameters on LLM Tuning in Real-World Applications

Alon Halfon, Shai Gretz, Ofir Arviv, Artem Spector, Orith Toledo-Ronen, Yoav Katz, Liat Ein-Dor, Michal Shmueli-Scheuer, Noam Slonim

TL;DR

The paper addresses the practical challenge of hyperparameter optimization for fine-tuning LLMs in real-world settings. It introduces Coverage-based Search (CBS), a grid-search-driven approach that ranks HP configurations by cross-dataset coverage to yield robust recommendations across tasks and domains. Across two models (Llama-3-8B and Mistral-7B-v0.3) and two tuning methods (FFT and LoRA), CBS_1 consistently outperforms public defaults and approaches the performance of an upper-bound per-dataset grid search, with LoRA often preferred in small-data regimes. The work delivers concrete HP recommendations, demonstrates substantial practical gains with limited search budgets, and offers a scalable framework for extending HP guidance to additional models and tuning strategies, thereby helping practitioners save compute while achieving strong fine-tuning performance.

Abstract

Fine-tuning Large Language Models (LLMs) is an effective method to enhance their performance on downstream tasks. However, choosing the appropriate setting of tuning hyperparameters (HPs) is a labor-intensive and computationally expensive process. Here, we provide recommended HP configurations for practical use-cases that represent a better starting point for practitioners, when considering two SOTA LLMs and two commonly used tuning methods. We describe Coverage-based Search (CBS), a process for ranking HP configurations based on an offline extensive grid search, such that the top ranked configurations collectively provide a practical robust recommendation for a wide range of datasets and domains. We focus our experiments on Llama-3-8B and Mistral-7B, as well as full fine-tuning and LoRa, conducting a total of > 10,000 tuning experiments. Our results suggest that, in general, Llama-3-8B and LoRA should be preferred, when possible. Moreover, we show that for both models and tuning methods, exploring only a few HP configurations, as recommended by our analysis, can provide excellent results in practice, making this work a valuable resource for practitioners.

Stay Tuned: An Empirical Study of the Impact of Hyperparameters on LLM Tuning in Real-World Applications

TL;DR

The paper addresses the practical challenge of hyperparameter optimization for fine-tuning LLMs in real-world settings. It introduces Coverage-based Search (CBS), a grid-search-driven approach that ranks HP configurations by cross-dataset coverage to yield robust recommendations across tasks and domains. Across two models (Llama-3-8B and Mistral-7B-v0.3) and two tuning methods (FFT and LoRA), CBS_1 consistently outperforms public defaults and approaches the performance of an upper-bound per-dataset grid search, with LoRA often preferred in small-data regimes. The work delivers concrete HP recommendations, demonstrates substantial practical gains with limited search budgets, and offers a scalable framework for extending HP guidance to additional models and tuning strategies, thereby helping practitioners save compute while achieving strong fine-tuning performance.

Abstract

Fine-tuning Large Language Models (LLMs) is an effective method to enhance their performance on downstream tasks. However, choosing the appropriate setting of tuning hyperparameters (HPs) is a labor-intensive and computationally expensive process. Here, we provide recommended HP configurations for practical use-cases that represent a better starting point for practitioners, when considering two SOTA LLMs and two commonly used tuning methods. We describe Coverage-based Search (CBS), a process for ranking HP configurations based on an offline extensive grid search, such that the top ranked configurations collectively provide a practical robust recommendation for a wide range of datasets and domains. We focus our experiments on Llama-3-8B and Mistral-7B, as well as full fine-tuning and LoRa, conducting a total of > 10,000 tuning experiments. Our results suggest that, in general, Llama-3-8B and LoRA should be preferred, when possible. Moreover, we show that for both models and tuning methods, exploring only a few HP configurations, as recommended by our analysis, can provide excellent results in practice, making this work a valuable resource for practitioners.
Paper Structure (22 sections, 6 equations, 2 figures, 5 tables)

This paper contains 22 sections, 6 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Cartesian product defining our experimental setup. The evaluation is performed across two models, two tuning methods, three tasks with multiple datasets, two training set sizes, and over multiple HPs.
  • Figure 2: Effect of increasing HP configuration budget. Y-axis denotes macro-averaged scores over all datasets and training sizes normalized w.r.t the upper bound score obtained by the respective model and tuning method (i.e., $s_n(c))$.