Evalita-LLM: Benchmarking Large Language Models on Italian

Bernardo Magnini; Roberto Zanoli; Michele Resta; Martin Cimmino; Paolo Albano; Marco Madeddu; Viviana Patti

Evalita-LLM: Benchmarking Large Language Models on Italian

Bernardo Magnini, Roberto Zanoli, Michele Resta, Martin Cimmino, Paolo Albano, Marco Madeddu, Viviana Patti

TL;DR

Evalita-LLM introduces a native-Italian benchmark for evaluating Large Language Models, emphasizing tasks conducted entirely in Italian to avoid translation and cultural biases. The framework supports both traditional multiple-choice and generative prompting, with an iterative development method that validates candidate tasks and prompts against development LLMs. By using the lm-evaluation-harness workflow and multiple prompt templates, Evalita-LLM provides robust, prompt-agnostic assessment across ten diverse tasks, including WiC, TE, SA, HS, FAQ, AT, LS, NER, REL, and SUM. Experimental results on dev LLMs demonstrate meaningful task difficulty and prompt sensitivity, highlighting the importance of prompting strategies and show that the benchmark is publicly accessible on Hugging Face. Overall, Evalita-LLM advances Italian NLP evaluation by offering native datasets, flexible prompting, and comprehensive metrics to gauge both peak and robust model performance in realistic settings.

Abstract

We describe Evalita-LLM, a new benchmark designed to evaluate Large Language Models (LLMs) on Italian tasks. The distinguishing and innovative features of Evalita-LLM are the following: (i) all tasks are native Italian, avoiding issues of translating from Italian and potential cultural biases; (ii) in addition to well established multiple-choice tasks, the benchmark includes generative tasks, enabling more natural interaction with LLMs; (iii) all tasks are evaluated against multiple prompts, this way mitigating the model sensitivity to specific prompts and allowing a fairer and objective evaluation. We propose an iterative methodology, where candidate tasks and candidate prompts are validated against a set of LLMs used for development. We report experimental results from the benchmark's development phase, and provide performance statistics for several state-of-the-art LLMs.

Evalita-LLM: Benchmarking Large Language Models on Italian

TL;DR

Abstract

Evalita-LLM: Benchmarking Large Language Models on Italian

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)