Table of Contents
Fetching ...

Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates

Yusuke Sakai, Adam Nohejl, Jiangnan Hang, Hidetaka Kamigaito, Taro Watanabe

TL;DR

This work tackles the problem of prompt induced score variance in large language model NLU evaluation by constructing English Japanese cross lingual benchmarks with multiple instruction templates and regex constrained outputs. It introduces the Sharpe score, a variance aware metric that balances mean performance with template variance, to enable fairer comparisons across models and settings. Through extensive zero shot and fine tuning experiments across language pairs and decoding regimes, the study shows that template variance can significantly affect model rankings and performance, and that constrained decoding generally aids zero shot evaluation while greedy decoding benefits fine tuning. The findings highlight the importance of evaluating LLMs with diverse templates and robust formatting constraints to accurately gauge generalization and cross lingual transfer capabilities, informing future model development and benchmarking frameworks.

Abstract

The natural language understanding (NLU) performance of large language models (LLMs) has been evaluated across various tasks and datasets. The existing evaluation methods, however, do not take into account the variance in scores due to differences in prompts, which leads to unfair evaluation and comparison of NLU performance. Moreover, evaluation designed for specific prompts is inappropriate for instruction tuning, which aims to perform well with any prompt. It is therefore necessary to find a way to measure NLU performance in a fair manner, considering score variance between different instruction templates. In this study, we provide English and Japanese cross-lingual datasets for evaluating the NLU performance of LLMs, which include multiple instruction templates for fair evaluation of each task, along with regular expressions to constrain the output format. Furthermore, we propose the Sharpe score as an evaluation metric that takes into account the variance in scores between templates. Comprehensive analysis of English and Japanese LLMs reveals that the high variance among templates has a significant impact on the fair evaluation of LLMs.

Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates

TL;DR

This work tackles the problem of prompt induced score variance in large language model NLU evaluation by constructing English Japanese cross lingual benchmarks with multiple instruction templates and regex constrained outputs. It introduces the Sharpe score, a variance aware metric that balances mean performance with template variance, to enable fairer comparisons across models and settings. Through extensive zero shot and fine tuning experiments across language pairs and decoding regimes, the study shows that template variance can significantly affect model rankings and performance, and that constrained decoding generally aids zero shot evaluation while greedy decoding benefits fine tuning. The findings highlight the importance of evaluating LLMs with diverse templates and robust formatting constraints to accurately gauge generalization and cross lingual transfer capabilities, informing future model development and benchmarking frameworks.

Abstract

The natural language understanding (NLU) performance of large language models (LLMs) has been evaluated across various tasks and datasets. The existing evaluation methods, however, do not take into account the variance in scores due to differences in prompts, which leads to unfair evaluation and comparison of NLU performance. Moreover, evaluation designed for specific prompts is inappropriate for instruction tuning, which aims to perform well with any prompt. It is therefore necessary to find a way to measure NLU performance in a fair manner, considering score variance between different instruction templates. In this study, we provide English and Japanese cross-lingual datasets for evaluating the NLU performance of LLMs, which include multiple instruction templates for fair evaluation of each task, along with regular expressions to constrain the output format. Furthermore, we propose the Sharpe score as an evaluation metric that takes into account the variance in scores between templates. Comprehensive analysis of English and Japanese LLMs reveals that the high variance among templates has a significant impact on the fair evaluation of LLMs.
Paper Structure (39 sections, 2 equations, 7 figures, 9 tables)

This paper contains 39 sections, 2 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Examples of the dataset creation process for the MNLI task. We modified the original FLAN templates for evaluation, as highlighted in green. A regular expression (RE) shown in the purple area is attached to the expected answer format. We translated this template to create the Japanese templates described in Appendix \ref{['sec:japanese-instruction-template']}.
  • Figure 2: Evaluation results for each template when trained with only a single template. The results show the evaluation for each template after training only using the template with ID 0-0 (positioned at the top in the figure). The first part of the template number indicates the type of template, and the second part indicates the type of answer format. The types of answer formats are described in Figure \ref{['fig:jnli-example']}. The LLMs used for evaluation are StableLM-ja-7B, StableLM-ja-7B-inst, ELYZA-Llama-2-7B, and ELYZA-Llama-2-7B-inst.
  • Figure 3: Changes in the rankings of each model when the Sharpe score parameter $\alpha$ is varied from 0 to 2 in increments of 0.1 in the fine-tuning setting on the Japanese dataset. The vertical axis represents the ranking of each model, and the horizontal axis represents $\alpha$. The more intersections of the lines, the greater the variance among the templates. This suggests that the rankings of the models frequently change with the variation of the parameter.
  • Figure 4: Dataset sources and number of each instruction template.
  • Figure 5: Changes in the rankings of each model when the Sharpe score parameter $\alpha$ is varied from 0 to 2 in increments of 0.1 in the fine-tuning setting on the English dataset.
  • ...and 2 more figures