Table of Contents
Fetching ...

Challenging the Abilities of Large Language Models in Italian: a Community Initiative

Malvina Nissim, Danilo Croce, Viviana Patti, Pierpaolo Basile, Giuseppe Attanasio, Elio Musacchio, Matteo Rinaldi, Federico Borazio, Maria Francis, Jacopo Gili, Daniel Scalena, Begoña Altuna, Ekhi Azurmendi, Valerio Basile, Luisa Bentivogli, Arianna Bisazza, Marianna Bolognesi, Dominique Brunato, Tommaso Caselli, Silvia Casola, Maria Cassese, Mauro Cettolo, Claudia Collacciani, Leonardo De Cosmo, Maria Pia Di Buono, Andrea Esuli, Julen Etxaniz, Chiara Ferrando, Alessia Fidelangeli, Simona Frenda, Achille Fusco, Marco Gaido, Andrea Galassi, Federico Galli, Luca Giordano, Mattia Goffetti, Itziar Gonzalez-Dios, Lorenzo Gregori, Giulia Grundler, Sandro Iannaccone, Chunyang Jiang, Moreno La Quatra, Francesca Lagioia, Soda Marem Lo, Marco Madeddu, Bernardo Magnini, Raffaele Manna, Fabio Mercorio, Paola Merlo, Arianna Muti, Vivi Nastase, Matteo Negri, Dario Onorati, Elena Palmieri, Sara Papi, Lucia Passaro, Giulia Pensa, Andrea Piergentili, Daniele Potertì, Giovanni Puccetti, Federico Ranaldi, Leonardo Ranaldi, Andrea Amelio Ravelli, Martina Rosola, Elena Sofia Ruzzetti, Giuseppe Samo, Andrea Santilli, Piera Santin, Gabriele Sarti, Giovanni Sartor, Beatrice Savoldi, Antonio Serino, Andrea Seveso, Lucia Siciliani, Paolo Torroni, Rossella Varvara, Andrea Zaninello, Asya Zanollo, Fabio Massimo Zanzotto, Kamyar Zeinalipour, Andrea Zugarini

TL;DR

CALAMITA addresses a core problem: evaluating Large Language Models on Italian with native data and community-driven curation rather than relying on translated benchmarks or leaderboards.The authors introduce a collaborative methodology that federates 80+ contributors to design, document, and evaluate a diverse set of tasks spanning linguistic competence, reasoning, factual knowledge, and fairness, using a centralized evaluation pipeline.Key contributions include a rolling, scalable benchmark with 22 challenges and ~100 subtasks, detailed error analyses across models, and methodological lessons on task-representative metrics and harmonized pipelines, offering a blueprint for other languages.Findings show that model size and broad pretraining diversity are major drivers of performance, but gains are task- and format-dependent; Italian-specific tuning yields limited universal advantages, emphasizing the need for careful task design and evaluation.CALAMITA aims to be a sustainable, community-driven framework that can evolve with the international landscape, guiding inclusive, rigorous evaluation practices for languages with limited resources.

Abstract

The rapid progress of Large Language Models (LLMs) has transformed natural language processing and broadened its impact across research and society. Yet, systematic evaluation of these models, especially for languages beyond English, remains limited. "Challenging the Abilities of LAnguage Models in ITAlian" (CALAMITA) is a large-scale collaborative benchmarking initiative for Italian, coordinated under the Italian Association for Computational Linguistics. Unlike existing efforts that focus on leaderboards, CALAMITA foregrounds methodology: it federates more than 80 contributors from academia, industry, and the public sector to design, document, and evaluate a diverse collection of tasks, covering linguistic competence, commonsense reasoning, factual consistency, fairness, summarization, translation, and code generation. Through this process, we not only assembled a benchmark of over 20 tasks and almost 100 subtasks, but also established a centralized evaluation pipeline that supports heterogeneous datasets and metrics. We report results for four open-weight LLMs, highlighting systematic strengths and weaknesses across abilities, as well as challenges in task-specific evaluation. Beyond quantitative results, CALAMITA exposes methodological lessons: the necessity of fine-grained, task-representative metrics, the importance of harmonized pipelines, and the benefits and limitations of broad community engagement. CALAMITA is conceived as a rolling benchmark, enabling continuous integration of new tasks and models. This makes it both a resource -- the most comprehensive and diverse benchmark for Italian to date -- and a framework for sustainable, community-driven evaluation. We argue that this combination offers a blueprint for other languages and communities seeking inclusive and rigorous LLM evaluation practices.

Challenging the Abilities of Large Language Models in Italian: a Community Initiative

TL;DR

CALAMITA addresses a core problem: evaluating Large Language Models on Italian with native data and community-driven curation rather than relying on translated benchmarks or leaderboards.The authors introduce a collaborative methodology that federates 80+ contributors to design, document, and evaluate a diverse set of tasks spanning linguistic competence, reasoning, factual knowledge, and fairness, using a centralized evaluation pipeline.Key contributions include a rolling, scalable benchmark with 22 challenges and ~100 subtasks, detailed error analyses across models, and methodological lessons on task-representative metrics and harmonized pipelines, offering a blueprint for other languages.Findings show that model size and broad pretraining diversity are major drivers of performance, but gains are task- and format-dependent; Italian-specific tuning yields limited universal advantages, emphasizing the need for careful task design and evaluation.CALAMITA aims to be a sustainable, community-driven framework that can evolve with the international landscape, guiding inclusive, rigorous evaluation practices for languages with limited resources.

Abstract

The rapid progress of Large Language Models (LLMs) has transformed natural language processing and broadened its impact across research and society. Yet, systematic evaluation of these models, especially for languages beyond English, remains limited. "Challenging the Abilities of LAnguage Models in ITAlian" (CALAMITA) is a large-scale collaborative benchmarking initiative for Italian, coordinated under the Italian Association for Computational Linguistics. Unlike existing efforts that focus on leaderboards, CALAMITA foregrounds methodology: it federates more than 80 contributors from academia, industry, and the public sector to design, document, and evaluate a diverse collection of tasks, covering linguistic competence, commonsense reasoning, factual consistency, fairness, summarization, translation, and code generation. Through this process, we not only assembled a benchmark of over 20 tasks and almost 100 subtasks, but also established a centralized evaluation pipeline that supports heterogeneous datasets and metrics. We report results for four open-weight LLMs, highlighting systematic strengths and weaknesses across abilities, as well as challenges in task-specific evaluation. Beyond quantitative results, CALAMITA exposes methodological lessons: the necessity of fine-grained, task-representative metrics, the importance of harmonized pipelines, and the benefits and limitations of broad community engagement. CALAMITA is conceived as a rolling benchmark, enabling continuous integration of new tasks and models. This makes it both a resource -- the most comprehensive and diverse benchmark for Italian to date -- and a framework for sustainable, community-driven evaluation. We argue that this combination offers a blueprint for other languages and communities seeking inclusive and rigorous LLM evaluation practices.

Paper Structure

This paper contains 101 sections, 9 figures, 32 tables.

Figures (9)

  • Figure 1: Gist template.
  • Figure 2: Response distributions for the BLM-It task across the three subtasks: (a) Agreement, (b) Causative, and (c) Object-drop.
  • Figure 3: Results for the ECWCA task. Distribution of error types across models in the hint and no-hint settings. Categories include verbose, typo, inversion, incomplete, and wrong entity. See Table \ref{['tab:ecwca_error_examples']} for some error type examples. Correct predictions were marked as correct.
  • Figure 4: Error rates for the ITA-SENSE multiple-choice subtask across different levels of word polysemy. Each bar corresponds to words with two, three, or more meanings; values in parentheses indicate the number of instances per group.
  • Figure 5: Answer overlap across models in the Mult-IT task. The top diagram shows the intersection of answers for all questions, while the bottom focuses on cases where models agreed on the correct option. Each area reflects the proportion of shared predictions among models.
  • ...and 4 more figures