Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates
Yusuke Sakai, Adam Nohejl, Jiangnan Hang, Hidetaka Kamigaito, Taro Watanabe
TL;DR
This work tackles the problem of prompt induced score variance in large language model NLU evaluation by constructing English Japanese cross lingual benchmarks with multiple instruction templates and regex constrained outputs. It introduces the Sharpe score, a variance aware metric that balances mean performance with template variance, to enable fairer comparisons across models and settings. Through extensive zero shot and fine tuning experiments across language pairs and decoding regimes, the study shows that template variance can significantly affect model rankings and performance, and that constrained decoding generally aids zero shot evaluation while greedy decoding benefits fine tuning. The findings highlight the importance of evaluating LLMs with diverse templates and robust formatting constraints to accurately gauge generalization and cross lingual transfer capabilities, informing future model development and benchmarking frameworks.
Abstract
The natural language understanding (NLU) performance of large language models (LLMs) has been evaluated across various tasks and datasets. The existing evaluation methods, however, do not take into account the variance in scores due to differences in prompts, which leads to unfair evaluation and comparison of NLU performance. Moreover, evaluation designed for specific prompts is inappropriate for instruction tuning, which aims to perform well with any prompt. It is therefore necessary to find a way to measure NLU performance in a fair manner, considering score variance between different instruction templates. In this study, we provide English and Japanese cross-lingual datasets for evaluating the NLU performance of LLMs, which include multiple instruction templates for fair evaluation of each task, along with regular expressions to constrain the output format. Furthermore, we propose the Sharpe score as an evaluation metric that takes into account the variance in scores between templates. Comprehensive analysis of English and Japanese LLMs reveals that the high variance among templates has a significant impact on the fair evaluation of LLMs.
