Table of Contents
Fetching ...

Call for Rigor in Reporting Quality of Instruction Tuning Data

Hyeonseok Moon, Jaehyung Seo, Heuiseok Lim

TL;DR

The paper tackles how the perceived quality of instruction-tuning data is confounded by arbitrary hyperparameter choices during model training. It uses two 1K general-domain datasets, LIMA and Alpaca-Longest, across Llama-2-7B (and Mistral-7B in Appendix) to show that data-quality conclusions depend on the chosen hyperparameters. The authors propose a local hyperparameter pool and recommend reporting the best-performing settings within that pool to reliably assess data quality, acknowledging the additional computational costs. This work highlights a practical issue for LLM alignment benchmarking and motivates more rigorous, standardized validation practices to stabilize cross-study conclusions.

Abstract

Instruction tuning is crucial for adapting large language models (LLMs) to align with user intentions. Numerous studies emphasize the significance of the quality of instruction tuning (IT) data, revealing a strong correlation between IT data quality and the alignment performance of LLMs. In these studies, the quality of IT data is typically assessed by evaluating the performance of LLMs trained with that data. However, we identified a prevalent issue in such practice: hyperparameters for training models are often selected arbitrarily without adequate justification. We observed significant variations in hyperparameters applied across different studies, even when training the same model with the same data. In this study, we demonstrate the potential problems arising from this practice and emphasize the need for careful consideration in verifying data quality. Through our experiments on the quality of LIMA data and a selected set of 1,000 Alpaca data points, we demonstrate that arbitrary hyperparameter decisions can make any arbitrary conclusion.

Call for Rigor in Reporting Quality of Instruction Tuning Data

TL;DR

The paper tackles how the perceived quality of instruction-tuning data is confounded by arbitrary hyperparameter choices during model training. It uses two 1K general-domain datasets, LIMA and Alpaca-Longest, across Llama-2-7B (and Mistral-7B in Appendix) to show that data-quality conclusions depend on the chosen hyperparameters. The authors propose a local hyperparameter pool and recommend reporting the best-performing settings within that pool to reliably assess data quality, acknowledging the additional computational costs. This work highlights a practical issue for LLM alignment benchmarking and motivates more rigorous, standardized validation practices to stabilize cross-study conclusions.

Abstract

Instruction tuning is crucial for adapting large language models (LLMs) to align with user intentions. Numerous studies emphasize the significance of the quality of instruction tuning (IT) data, revealing a strong correlation between IT data quality and the alignment performance of LLMs. In these studies, the quality of IT data is typically assessed by evaluating the performance of LLMs trained with that data. However, we identified a prevalent issue in such practice: hyperparameters for training models are often selected arbitrarily without adequate justification. We observed significant variations in hyperparameters applied across different studies, even when training the same model with the same data. In this study, we demonstrate the potential problems arising from this practice and emphasize the need for careful consideration in verifying data quality. Through our experiments on the quality of LIMA data and a selected set of 1,000 Alpaca data points, we demonstrate that arbitrary hyperparameter decisions can make any arbitrary conclusion.

Paper Structure

This paper contains 20 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: The performance comparison between the two models trained with LIMA and Alpaca-Longest. We train Llama-2-7B model with each dataset, We evaluate the data quality when training each dataset with the Llama-2-7B model. is depicted on the Y-axis represents the hyperparameter settings used in each experiment. We bolded the settings that consistently demonstrated conclusive results across all three evaluation datasets.
  • Figure 2: The performance comparison between the two models trained with LIMA and Alpaca-Longest. We train Mistral-7B model with each dataset, We evaluate the data quality when training each dataset with the Mistral-7B model. is depicted on the Y-axis represents the hyperparameter settings used in each experiment. We bolded the settings that consistently demonstrated conclusive results across all three evaluation datasets.