Table of Contents
Fetching ...

Disce aut Deficere: Evaluating LLMs Proficiency on the INVALSI Italian Benchmark

Fabio Mercorio, Mario Mezzanzanica, Daniele Potertì, Antonio Serino, Andrea Seveso

Abstract

Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to generate and manipulate human language, highlighting their potential across various applications. Evaluating LLMs in languages other than English is crucial for ensuring their linguistic versatility, cultural relevance, and applicability in diverse global contexts, thus broadening their usability and effectiveness. We tackle this challenge by introducing a structured benchmark using the INVALSI tests, a set of well-established assessments designed to measure educational competencies across Italy. Our study makes three primary contributions: Firstly, we adapt the INVALSI benchmark for automated LLM evaluation, which involves rigorous adaptation of the test format to suit automated processing while retaining the essence of the original tests. Secondly, we provide a detailed assessment of current LLMs, offering a crucial reference point for the academic community. Finally, we visually compare the performance of these models against human results. Additionally, researchers are invited to submit their models for ongoing evaluation, ensuring the benchmark remains a current and valuable resource.

Disce aut Deficere: Evaluating LLMs Proficiency on the INVALSI Italian Benchmark

Abstract

Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to generate and manipulate human language, highlighting their potential across various applications. Evaluating LLMs in languages other than English is crucial for ensuring their linguistic versatility, cultural relevance, and applicability in diverse global contexts, thus broadening their usability and effectiveness. We tackle this challenge by introducing a structured benchmark using the INVALSI tests, a set of well-established assessments designed to measure educational competencies across Italy. Our study makes three primary contributions: Firstly, we adapt the INVALSI benchmark for automated LLM evaluation, which involves rigorous adaptation of the test format to suit automated processing while retaining the essence of the original tests. Secondly, we provide a detailed assessment of current LLMs, offering a crucial reference point for the academic community. Finally, we visually compare the performance of these models against human results. Additionally, researchers are invited to submit their models for ongoing evaluation, ensuring the benchmark remains a current and valuable resource.

Paper Structure

This paper contains 35 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Visualising the accuracy of various models across different school grades. Each layer represents a different grade level, from 2nd grade in primary school to 13th grade in high school, showing the distribution of performance accuracy for each grade.
  • Figure 2: Distribution of accuracy scores of language models categorised by size: small, medium, and large. Each plot represents the distribution of accuracy scores within each category, with individual data points highlighted, each representing a test taken by a model, and the mean accuracy marked by a horizontal line.
  • Figure 3: Scatter plot visualising the accuracy of both human respondents and language models on various tests across different grade levels. The red lines represent the median accuracy of human answers at 59.8%. The graph is divided into four quadrants to categorise the performance: top-right quadrant ("Both"), where both humans and models perform well; top-left quadrant ("Humans"), where humans outperform models; bottom-right quadrant ("GenAI") where models outperform humans; and bottom-left quadrant ("Neither") where neither models nor humans perform well. Each symbol represents the average performance for each model size on a test, and colour coding corresponds to the educational grade level, providing an overview of where AI competes or lags behind human performance. Multiple data points with the same colour and symbol are shown wherever multiple tests for the same school grade exist.