Language Complexity Measurement as a Noisy Zero-Shot Proxy for Evaluating LLM Performance
Birger Moell, Johan Boye
TL;DR
The paper addresses whether language-complexity metrics can serve as noisy zero-shot proxies for LLM performance, by evaluating six models on two tasks—LIX readability and Average Dependency Distance (ADD) derived from dependency parsing—using Swedish essays and ground-truth references. It employs ground-truth LIX from a Python implementation and Stanza-derived dependency trees, comparing model outputs to these baselines, and analyzes the relationship between these complexity measures and the Massive Multitask Language Understanding (MMLU) benchmark via Pearson correlations. The study finds that o1-mini consistently yields the best performance on LIX and ADD tasks, and reveals a strong negative correlation ($r = -0.875$, $p = 0.026$) between LIX error and MMLU, suggesting that language-complexity abilities track with general model proficiency, albeit noisily. The results support using complexity metrics as a practical, language-independent proxy for quick model evaluation, while acknowledging limitations related to single runs, domain scope, model openness, and data size that warrant cautious interpretation and further study. $LIX = A/B + 100C/A$.
Abstract
Large Language Models (LLMs) have made significant strides in natural language generation but often face challenges in tasks requiring precise calculations and structural analysis. This paper investigates the performance of state-of-the-art LLMs on language complexity measurement tasks, through the computation of the LIX readability metric and Average Dependency Distance (ADD). Using Swedish high school and university-level essays, we evaluate the models' abilities to compute LIX scores and perform dependency parsing, comparing their results to established ground truths. Our findings reveal that while all models demonstrate some capacity for these tasks, ChatGPT-o1-mini performs most consistently, achieving the highest accuracy in both LIX computation and dependency parsing. Additionally, we observe a strong significant correlation -0.875 p 0.026 (N=6) between the models' accuracy in computing LIX and their overall performance on the Massive Multitask Language Understanding (MMLU) benchmark. These results suggest that language complexity measurement abilities can serve as a noisy zero-shot proxies for assessing the general capabilities of LLMs, providing a practical method for model evaluation without the need for extensive benchmarking datasets.
