Table of Contents
Fetching ...

Cabrita: closing the gap for foreign languages

Celio Larcher, Marcos Piau, Paulo Finardi, Pedro Gengo, Piero Esposito, Vinicius Caridá

TL;DR

Cabrita offers a low-cost path to language-specific model proficiency by coupling tokenizer adaptation with Portuguese continual pre-training on OpenLLaMA. The openCabrita models achieve substantial tokenization efficiency gains and competitive Portuguese benchmarks at 3B, approaching larger language-specific systems while reducing inference costs. The approach demonstrates that strategic tokenizer redesign paired with targeted pre-training can close language gaps without prohibitive resource demands, with promising potential for other underrepresented languages. Future work aims to scale to larger models, broaden language coverage, and establish broader benchmarks.

Abstract

The strategy of training the model from scratch in a specific language or domain serves two essential purposes: i) enhancing performance in the particular linguistic or domain context, and ii) ensuring effective tokenization. The main limitation inherent to this approach lies in the associated cost, which can reach six to seven-digit dollar values, depending on the model size and the number of parameters involved. The main solution to overcome the cost challenge is to rely on available pre-trained models, which, despite recent advancements such as the LLaMA and LLaMA-2 models, still demonstrate inefficiency for certain specific domain problems or prove ineffective in scenarios involving conversational memory resources, given the large number of tokens required to represent text. To overcome this issue, we present a methodology named Cabrita, which, as our research demonstrates, successfully addresses the performance and efficient tokenization problem, all at an affordable cost. We believe that this methodology can be applied to any transformer-like architecture model. To validate the study, we conducted continuous pre-training exclusively using Portuguese text on a 3-billion-parameter model known as OpenLLaMA, resulting in a model named openCabrita 3B. The openCabrita 3B also features a new tokenizer that results in a significant reduction in the number of tokens required to represent the text. In our assessment, for few-shot learning tasks, we achieved similar results with this 3B model compared to a traditional continuous pre-training approach as well as to 7B models English pre-trained models.

Cabrita: closing the gap for foreign languages

TL;DR

Cabrita offers a low-cost path to language-specific model proficiency by coupling tokenizer adaptation with Portuguese continual pre-training on OpenLLaMA. The openCabrita models achieve substantial tokenization efficiency gains and competitive Portuguese benchmarks at 3B, approaching larger language-specific systems while reducing inference costs. The approach demonstrates that strategic tokenizer redesign paired with targeted pre-training can close language gaps without prohibitive resource demands, with promising potential for other underrepresented languages. Future work aims to scale to larger models, broaden language coverage, and establish broader benchmarks.

Abstract

The strategy of training the model from scratch in a specific language or domain serves two essential purposes: i) enhancing performance in the particular linguistic or domain context, and ii) ensuring effective tokenization. The main limitation inherent to this approach lies in the associated cost, which can reach six to seven-digit dollar values, depending on the model size and the number of parameters involved. The main solution to overcome the cost challenge is to rely on available pre-trained models, which, despite recent advancements such as the LLaMA and LLaMA-2 models, still demonstrate inefficiency for certain specific domain problems or prove ineffective in scenarios involving conversational memory resources, given the large number of tokens required to represent text. To overcome this issue, we present a methodology named Cabrita, which, as our research demonstrates, successfully addresses the performance and efficient tokenization problem, all at an affordable cost. We believe that this methodology can be applied to any transformer-like architecture model. To validate the study, we conducted continuous pre-training exclusively using Portuguese text on a 3-billion-parameter model known as OpenLLaMA, resulting in a model named openCabrita 3B. The openCabrita 3B also features a new tokenizer that results in a significant reduction in the number of tokens required to represent the text. In our assessment, for few-shot learning tasks, we achieved similar results with this 3B model compared to a traditional continuous pre-training approach as well as to 7B models English pre-trained models.
Paper Structure (15 sections, 1 figure, 7 tables)

This paper contains 15 sections, 1 figure, 7 tables.

Figures (1)

  • Figure 1: Tokenizer efficiency: the X axis show the number of tokens required to represent 7400 words of the Constitutional law of the USA in English, while the Y axis shows the same, but using a Portuguese translated version. The size of each sphere represents the vocabulary size of each tokenizer.