Sabiá-2: A New Generation of Portuguese Large Language Models

Thales Sales Almeida; Hugo Abonizio; Rodrigo Nogueira; Ramon Pires

Sabiá-2: A New Generation of Portuguese Large Language Models

Thales Sales Almeida, Hugo Abonizio, Rodrigo Nogueira, Ramon Pires

TL;DR

Sabi-a-2, a family of large language models trained on Portuguese texts, is introduced and it is identified that math and coding are key abilities that need improvement.

Abstract

We introduce Sabiá-2, a family of large language models trained on Portuguese texts. The models are evaluated on a diverse range of exams, including entry-level tests for Brazilian universities, professional certification exams, and graduate-level exams for various disciplines such as accounting, economics, engineering, law and medicine. Our results reveal that our best model so far, Sabiá-2 Medium, matches or surpasses GPT-4's performance in 23 out of 64 exams and outperforms GPT-3.5 in 58 out of 64 exams. Notably, specialization has a significant impact on a model's performance without the need to increase its size, allowing us to offer Sabiá-2 Medium at a price per token that is 10 times cheaper than GPT-4. Finally, we identified that math and coding are key abilities that need improvement.

Sabiá-2: A New Generation of Portuguese Large Language Models

TL;DR

Sabi-a-2, a family of large language models trained on Portuguese texts, is introduced and it is identified that math and coding are key abilities that need improvement.

Abstract

Paper Structure (15 sections, 7 figures, 5 tables)

This paper contains 15 sections, 7 figures, 5 tables.

Introduction
Academic and Professional Benchmarks
University Admission Exams
Undergraduate Exams
Professional Certification Exams
Brazilian Chat Evaluation
Capabilities on Academic Exams
Pricing versus Performance
Capabilities on Conversations
Limitations
Conclusion
Acknowledgments
Evaluating the impact of chat mode
Examples of BRACEval
Results on BRACEval per Category

Figures (7)

Figure 1: Results of Sabiá-2, GPT-3.5 Turbo and GPT-4 Turbo on Enade 2022 and 2023 exams, ordered from low to high based on Sabiá-2 performance. Sabiá-2 outperforms GPT-3.5 Turbo on most exams, except Control and Automation Engineering, and Medicine. The lowest accuracies were achieved in domains related to engineering and economics. However, GPT-4 Turbo demonstrates consistent high accuracy across the spectrum.
Figure 2: Performance of Sabiá-2 and other proprietary LLMs on benchmarks of university admission exams: ENEM and BLUEX. The benchmark includes 3 exams.
Figure 3: Performance of Sabiá-2 and other proprietary LLMs on benchmarks of undergraduate exams: ENADE, POSCOMP, and MREX.
Figure 4: Performance of Sabiá-2 and other proprietary LLMs on benchmarks of professional certification exams: OAB, CFCES, and Revalida.
Figure 5: Pricing versus Performance of Sabiá-2 and other proprietary LLMs on Exams taken after mid-2023. Sabiá-2 Medium demonstrates superior performance compared to Mistral Large and Claude 3 Sonnet with a 3-point advantage while being significantly more cost-effective, priced at 4.5 times less than Claude 3 Sonnet and 8 times less than Mistral Large.
...and 2 more figures

Sabiá-2: A New Generation of Portuguese Large Language Models

TL;DR

Abstract

Sabiá-2: A New Generation of Portuguese Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)