Table of Contents
Fetching ...

GlórIA -- A Generative and Open Large Language Model for Portuguese

Ricardo Lopes, João Magalhães, David Semedo

TL;DR

GlórIA addresses the scarcity of European Portuguese LLM resources by building a decoder LLM trained on a 35.5B-token PT-PT corpus and introducing CALAME-PT, a zero-shot language-modeling benchmark. The paper details corpus assembly, model architecture (1.3B and 2.7B), a LLAMA-like data sampling strategy, and pre-training details, demonstrating that GlórIA outperforms existing PT open decoder models on language modeling and shows competitive results on discriminative tasks. It also compares GlórIA to PT encoder models, highlighting strong baseline performance while acknowledging limitations and the need for future work in scaling, data diversity, and multimodal extensions. Overall, GlórIA establishes a solid PT-PT generative foundation with broad potential impact on European Portuguese NLP and downstream applications.

Abstract

Significant strides have been made in natural language tasks, largely attributed to the emergence of powerful large language models (LLMs). These models, pre-trained on extensive and diverse corpora, have become increasingly capable of comprehending the intricacies of language. Despite the abundance of LLMs for many high-resource languages, the availability of such models remains limited for European Portuguese. We introduce GlórIA, a robust European Portuguese decoder LLM. To pre-train GlórIA, we assembled a comprehensive PT-PT text corpus comprising 35 billion tokens from various sources. We present our pre-training methodology, followed by an assessment of the model's effectiveness on multiple downstream tasks. Additionally, to evaluate our models' language modeling capabilities, we introduce CALAME-PT (Context-Aware LAnguage Modeling Evaluation for Portuguese), the first Portuguese zero-shot language-modeling benchmark. Evaluation shows that GlórIA significantly outperforms existing open PT decoder models in language modeling and that it can generate sound, knowledge-rich, and coherent PT-PT text. The model also exhibits strong potential for various downstream tasks.

GlórIA -- A Generative and Open Large Language Model for Portuguese

TL;DR

GlórIA addresses the scarcity of European Portuguese LLM resources by building a decoder LLM trained on a 35.5B-token PT-PT corpus and introducing CALAME-PT, a zero-shot language-modeling benchmark. The paper details corpus assembly, model architecture (1.3B and 2.7B), a LLAMA-like data sampling strategy, and pre-training details, demonstrating that GlórIA outperforms existing PT open decoder models on language modeling and shows competitive results on discriminative tasks. It also compares GlórIA to PT encoder models, highlighting strong baseline performance while acknowledging limitations and the need for future work in scaling, data diversity, and multimodal extensions. Overall, GlórIA establishes a solid PT-PT generative foundation with broad potential impact on European Portuguese NLP and downstream applications.

Abstract

Significant strides have been made in natural language tasks, largely attributed to the emergence of powerful large language models (LLMs). These models, pre-trained on extensive and diverse corpora, have become increasingly capable of comprehending the intricacies of language. Despite the abundance of LLMs for many high-resource languages, the availability of such models remains limited for European Portuguese. We introduce GlórIA, a robust European Portuguese decoder LLM. To pre-train GlórIA, we assembled a comprehensive PT-PT text corpus comprising 35 billion tokens from various sources. We present our pre-training methodology, followed by an assessment of the model's effectiveness on multiple downstream tasks. Additionally, to evaluate our models' language modeling capabilities, we introduce CALAME-PT (Context-Aware LAnguage Modeling Evaluation for Portuguese), the first Portuguese zero-shot language-modeling benchmark. Evaluation shows that GlórIA significantly outperforms existing open PT decoder models in language modeling and that it can generate sound, knowledge-rich, and coherent PT-PT text. The model also exhibits strong potential for various downstream tasks.
Paper Structure (27 sections, 3 figures, 12 tables)

This paper contains 27 sections, 3 figures, 12 tables.

Figures (3)

  • Figure 1: GlórIA 1.3B pre-training loss and perplexity.
  • Figure 2: Overview of the CALAME-PT's generated set creation process.
  • Figure 3: Evolution of GlórIA 1.3B performance on CALAME-PT. Evaluated at 3 distinct checkpoints (1M, 2M, and 3M steps) for both decoding strategies. EM denotes Exact-Match.