Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach

Kun Sun; Rong Wang; Anders Søgaard

Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach

Kun Sun, Rong Wang, Anders Søgaard

TL;DR

This study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods and introduces a comprehensive statistical methodology, offering a robust and transparent approach to deciphering LLM performance data.

Abstract

Amidst the rapid evolution of LLMs, the significance of evaluation in comprehending and propelling these models forward is increasingly paramount. Evaluations have revealed that factors such as scaling, training types, architectures and other factors profoundly impact the performance of LLMs. However, the extent and nature of these impacts continue to be subjects of debate because most assessments have been restricted to a limited number of models and data points. Clarifying the effects of these factors on performance scores can be more effectively achieved through a statistical lens. Our study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods. With the advent of a uniform evaluation framework, our research leverages an expansive dataset of evaluation results, introducing a comprehensive statistical methodology. This includes the application of ANOVA, Tukey HSD tests, GAMM, and clustering technique, offering a robust and transparent approach to deciphering LLM performance data. Contrary to prevailing findings, our results challenge assumptions about emergent abilities and the influence of given training types and architectures in LLMs. These findings furnish new perspectives on the characteristics, intrinsic nature, and developmental trajectories of LLMs. By providing straightforward and reliable methods to scrutinize and reassess LLM performance data, this study contributes a nuanced perspective on LLM efficiency and potentials.

Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach

TL;DR

Abstract

Paper Structure (14 sections, 12 figures, 2 tables)

This paper contains 14 sections, 12 figures, 2 tables.

Introduction
Methods
The datasets
Re-evaluation methods
Results
Result 1: Difference analysis by parameter, and training type / architecture
Result 2: GAMM analysis: Emergent abilities and interplay of various abilities
Clusters with key factors
Discussion
Conclusion
ANOVA, Tukey Tests and Correlations
GAMMs and the Partial Effects on Scaled Data
T-sne Clusters
Results of the supplementary dataset

Figures (12)

Figure 1: Reassessment methods for massive LLMs evaluation outcomes
Figure 2: Emergent abilities of LLMs are created by the chosen metrics schaeffer2023emergent, not unpredictable changes in model behavior with scale.
Figure 3: Partial effects of parameters on LLMs performance scores. (In each plot, the vertical lines signify five quartiles, which divide the data on parameters into five equal quartiles: 0%[0.01B] - 20%[1.31B] - 40%[6.53B] - 60%[6.65B] - 80%[12.85B] - 100%[180B]. X-axis "log_Param" is the logarithm of model training parameters, and y-axis is the logarithm of scores in each evaluation dataset. A curve of partial effects represents the relationship between a predictor variable and the response variable. Steeper slopes suggest a stronger relationship, and flatter slopes imply a weaker one. A curve of partial effects represents the relationship between a predictor variable and the response variable when a curve fluctuates around zero, it indicates that the effect is weak. The pointwise 95%-confidence intervals are shown by the blue shadow.)
Figure 4: Correlations among various benchmark datasets
Figure 5: Partial effects of one given ability on other abilities in LLMs. The x-axis represents the logarithmic scale of a specific ability, while the y-axis corresponds to the logarithmic scale of various other abilities. The slope of the curve provides insights into the strength of this relationship: a steeper slope indicates a more pronounced effect, whereas a gentler slope suggests a more subdued impact. Notably, when the value of p is less than 0.01, the curve tends to level off near zero. This phenomenon signifies that the ability in question has little to no influence on the other ability.
...and 7 more figures

Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach

TL;DR

Abstract

Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach

Authors

TL;DR

Abstract

Table of Contents

Figures (12)