Revealing the structure of language model capabilities

Ryan Burnell; Han Hao; Andrew R. A. Conway; Jose Hernandez Orallo

Revealing the structure of language model capabilities

Ryan Burnell, Han Hao, Andrew R. A. Conway, Jose Hernandez Orallo

TL;DR

Reveals that LLM capabilities are multidimensional rather than monolithic by analyzing performance across 29 models on the HELM benchmark. The authors apply both frequentist and Bayesian factor analysis to extract three latent abilities—comprehension, language modeling, and reasoning—and show these explain substantial variance and relate differently to model size and instruction tuning. The results suggest a scalable structure and imply benchmark design could be streamlined to target these three abilities. The work also highlights data sharing and calls for replication with larger samples.

Abstract

Building a theoretical understanding of the capabilities of large language models (LLMs) is vital for our ability to predict and explain the behavior of these systems. Here, we investigate the structure of LLM capabilities by extracting latent capabilities from patterns of individual differences across a varied population of LLMs. Using a combination of Bayesian and frequentist factor analysis, we analyzed data from 29 different LLMs across 27 cognitive tasks. We found evidence that LLM capabilities are not monolithic. Instead, they are better explained by three well-delineated factors that represent reasoning, comprehension and core language modeling. Moreover, we found that these three factors can explain a high proportion of the variance in model performance. These results reveal a consistent structure in the capabilities of different LLMs and demonstrate the multifaceted nature of these capabilities. We also found that the three abilities show different relationships to model properties such as model size and instruction tuning. These patterns help refine our understanding of scaling laws and indicate that changes to a model that improve one ability might simultaneously impair others. Based on these findings, we suggest that benchmarks could be streamlined by focusing on tasks that tap into each broad model ability.

Revealing the structure of language model capabilities

TL;DR

Abstract

Paper Structure (14 sections, 6 figures, 3 tables)

This paper contains 14 sections, 6 figures, 3 tables.

Introduction
Methods
HELM data
Task selection
Task demand coding
Results
Task correlations
Factor analysis
Bayesian factor analysis
Model factor scores
Relationships with model characteristics
Discussion
Correlations between tasks
Determining Factor Structure

Figures (6)

Figure 1: Task annotations and factor loadings for each task from the Bayesian factor analysis (left) and the frequentist factor analysis (right). Darker greens represent stronger positive loadings, darker reds represent stronger negative loadings. Note that the Bayesian method proposed by Conti et al. conti2014 only calculates factor loadings with the assigned factor.
Figure 2: Factor scores for each model on the three factors based on the frequentist analysis. Darker greens represent higher factor scores (greater levels of ability), while darker reds represent lower factor scores (lower levels of ability). Models are sorted by reasoning score.
Figure 3: Plots of the relationships between log model size (billions) and the extracted factor scores for each factor. Lines represent the linear relationship between the variables with 95% confidence bands.
Figure 4: Scree plot of eigenvalues associated with each factor. The red line represents the standard cutoff (eigenvalue = 1).
Figure 5: Hull method plot of goodness of fit (f; calculated as 1 - RMSEA), against model degrees of freedom.
...and 1 more figures

Revealing the structure of language model capabilities

TL;DR

Abstract

Revealing the structure of language model capabilities

Authors

TL;DR

Abstract

Table of Contents

Figures (6)