The Invalsi Benchmarks: measuring Linguistic and Mathematical understanding of Large Language Models in Italian
Giovanni Puccetti, Maria Cassese, Andrea Esuli
TL;DR
This paper presents three native Italian benchmarks—Invalsi MATE, Invalsi ITA, and Olimpiadi MATE—to evaluate mathematical reasoning and language understanding in Italian LLMs. It evaluates 10 models across these benchmarks, highlighting that multilingual pretraining generally yields strong Italian performance and that Italian-specific fine-tuning offers limited gains relative to multilingual pretraining. The study finds that Llama 3.1 variants often lead performance on MATE-type tasks while Olimpiadi MATE remains highly challenging, with top results around $45\%$ accuracy. By comparing model performance to Italian students, the work demonstrates that current LLMs can outperform students on Italian-language tests and sometimes match or exceed human performance on math-oriented Italian benchmarks, underscoring the value and limitations of LLMs for non-English, education-focused evaluation. Data and code are slated for open release to support reproducibility and ongoing benchmarking as the Italian LLM ecosystem evolves.
Abstract
While Italian is a high-resource language, there are few Italian-native benchmarks to evaluate generative Large Language Models (LLMs) in this language. This work presents three new benchmarks: Invalsi MATE to evaluate models performance on mathematical understanding in Italian, Invalsi ITA to evaluate language understanding in Italian and Olimpiadi MATE for more complex mathematical understanding. The first two benchmarks are based on the Invalsi tests, which are administered to students of age between 6 and 18 within the Italian school system and have been validated by several experts in teaching and pedagogy, the third one comes from the Italian high school math Olympics. We evaluate 10 powerful language models on these benchmarks and find that they are bound by 71% accuracy on Invasli MATE, achieved by Llama 3.1 70b instruct and by 88% on Invalsi ITA. For both Invalsi MATE and Invalsi ITA we compare LLMs with the average performance of Italian students to show that Llama 3.1 is the only one to outperform them on Invalsi MATE while most models do so on Invalsi ITA, we then show that Olimpiadi MATE is more challenging than Invalsi MATE and the highest accuracy, achieved by Llama 3.1 405b instruct is 45%. We will make data and evaluation code openly available upon acceptance of the paper.
