Table of Contents
Fetching ...

CLARIN-PT-LDB: An Open LLM Leaderboard for Portuguese to assess Language, Culture and Civility

João Silva, Luís Gomes, António Branco

Abstract

This paper reports on the development of a leaderboard of Open Large Language Models (LLM) for European Portuguese (PT-PT), and on its associated benchmarks. This leaderboard comes as a way to address a gap in the evaluation of LLM for European Portuguese, which so far had no leaderboard dedicated to this variant of the language. The paper also reports on novel benchmarks, including some that address aspects of performance that so far have not been available in benchmarks for European Portuguese, namely model safeguards and alignment to Portuguese culture. The leaderboard is available at https://huggingface.co/spaces/PORTULAN/portuguese-llm-leaderboard.

CLARIN-PT-LDB: An Open LLM Leaderboard for Portuguese to assess Language, Culture and Civility

Abstract

This paper reports on the development of a leaderboard of Open Large Language Models (LLM) for European Portuguese (PT-PT), and on its associated benchmarks. This leaderboard comes as a way to address a gap in the evaluation of LLM for European Portuguese, which so far had no leaderboard dedicated to this variant of the language. The paper also reports on novel benchmarks, including some that address aspects of performance that so far have not been available in benchmarks for European Portuguese, namely model safeguards and alignment to Portuguese culture. The leaderboard is available at https://huggingface.co/spaces/PORTULAN/portuguese-llm-leaderboard.
Paper Structure (12 sections, 3 figures, 2 tables)

This paper contains 12 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Prompt template (abridged, and translated into English for the sake of readability) for the judge model that evaluates model answers on the DoNotAnswer-PT benchmark. The few-shot examples are omitted for readability. Typeset in italics are the names of the fields of the dataset that will be substituted in, forming the final prompt.
  • Figure 2: Prompt template for GPQA Diamond (translated into English for the sake of readability). Typeset in italics are the names of the fields of the dataset that will be substituted in, forming the final prompt.
  • Figure 3: Screenshot of the main leaderboard page. The top right has links to the "About" page and to the page where evaluation requests are submitted. The various feature selection and filtering buttons are automatically populated depending on the existing results.