The Invalsi Benchmarks: measuring Linguistic and Mathematical understanding of Large Language Models in Italian

Giovanni Puccetti; Maria Cassese; Andrea Esuli

The Invalsi Benchmarks: measuring Linguistic and Mathematical understanding of Large Language Models in Italian

Giovanni Puccetti, Maria Cassese, Andrea Esuli

TL;DR

This paper presents three native Italian benchmarks—Invalsi MATE, Invalsi ITA, and Olimpiadi MATE—to evaluate mathematical reasoning and language understanding in Italian LLMs. It evaluates 10 models across these benchmarks, highlighting that multilingual pretraining generally yields strong Italian performance and that Italian-specific fine-tuning offers limited gains relative to multilingual pretraining. The study finds that Llama 3.1 variants often lead performance on MATE-type tasks while Olimpiadi MATE remains highly challenging, with top results around $45\%$ accuracy. By comparing model performance to Italian students, the work demonstrates that current LLMs can outperform students on Italian-language tests and sometimes match or exceed human performance on math-oriented Italian benchmarks, underscoring the value and limitations of LLMs for non-English, education-focused evaluation. Data and code are slated for open release to support reproducibility and ongoing benchmarking as the Italian LLM ecosystem evolves.

Abstract

While Italian is a high-resource language, there are few Italian-native benchmarks to evaluate generative Large Language Models (LLMs) in this language. This work presents three new benchmarks: Invalsi MATE to evaluate models performance on mathematical understanding in Italian, Invalsi ITA to evaluate language understanding in Italian and Olimpiadi MATE for more complex mathematical understanding. The first two benchmarks are based on the Invalsi tests, which are administered to students of age between 6 and 18 within the Italian school system and have been validated by several experts in teaching and pedagogy, the third one comes from the Italian high school math Olympics. We evaluate 10 powerful language models on these benchmarks and find that they are bound by 71% accuracy on Invasli MATE, achieved by Llama 3.1 70b instruct and by 88% on Invalsi ITA. For both Invalsi MATE and Invalsi ITA we compare LLMs with the average performance of Italian students to show that Llama 3.1 is the only one to outperform them on Invalsi MATE while most models do so on Invalsi ITA, we then show that Olimpiadi MATE is more challenging than Invalsi MATE and the highest accuracy, achieved by Llama 3.1 405b instruct is 45%. We will make data and evaluation code openly available upon acceptance of the paper.

The Invalsi Benchmarks: measuring Linguistic and Mathematical understanding of Large Language Models in Italian

TL;DR

accuracy. By comparing model performance to Italian students, the work demonstrates that current LLMs can outperform students on Italian-language tests and sometimes match or exceed human performance on math-oriented Italian benchmarks, underscoring the value and limitations of LLMs for non-English, education-focused evaluation. Data and code are slated for open release to support reproducibility and ongoing benchmarking as the Italian LLM ecosystem evolves.

Abstract

Paper Structure (28 sections, 7 figures, 6 tables)

This paper contains 28 sections, 7 figures, 6 tables.

Introduction
Related work
Benchmarks
Benchmark Description
Invalsi MATE
Invalsi ITA
Olimpiadi MATE
Distribution by Grade
Evaluation
Models
English pre-trained and Italian fine-tuned
Multilingual pre-trained and Multilingual fine-tuned
English pre-trained and English fine-tuned
Italian pre-trained
Results on Invalsi MATE
...and 13 more sections

Figures (7)

Figure 1: We show that LLMs perform better than human students on Mathematical and Language understanding in Italian.
Figure 2: The distribution of Question types in Invalsi MATE and Invalsi ITA.
Figure 3: The distribution of Question types in Invalsi MATE and Invalsi ITA.
Figure 4: The performance stratified for different grades, in (\ref{['subfig:invalsi_mate_perf_grade_dist']}) for Invalsi MATE and in (\ref{['subfig:invalsi_ita_perf_grade_dist']}) for Invalsi ITA.
Figure 5: Performance of different Language Models on Invalsi MATE per grade level humanly assessed.
...and 2 more figures

The Invalsi Benchmarks: measuring Linguistic and Mathematical understanding of Large Language Models in Italian

TL;DR

Abstract

The Invalsi Benchmarks: measuring Linguistic and Mathematical understanding of Large Language Models in Italian

Authors

TL;DR

Abstract

Table of Contents

Figures (7)