Table of Contents
Fetching ...

Sabiá-2: A New Generation of Portuguese Large Language Models

Thales Sales Almeida, Hugo Abonizio, Rodrigo Nogueira, Ramon Pires

TL;DR

Sabi-a-2, a family of large language models trained on Portuguese texts, is introduced and it is identified that math and coding are key abilities that need improvement.

Abstract

We introduce Sabiá-2, a family of large language models trained on Portuguese texts. The models are evaluated on a diverse range of exams, including entry-level tests for Brazilian universities, professional certification exams, and graduate-level exams for various disciplines such as accounting, economics, engineering, law and medicine. Our results reveal that our best model so far, Sabiá-2 Medium, matches or surpasses GPT-4's performance in 23 out of 64 exams and outperforms GPT-3.5 in 58 out of 64 exams. Notably, specialization has a significant impact on a model's performance without the need to increase its size, allowing us to offer Sabiá-2 Medium at a price per token that is 10 times cheaper than GPT-4. Finally, we identified that math and coding are key abilities that need improvement.

Sabiá-2: A New Generation of Portuguese Large Language Models

TL;DR

Sabi-a-2, a family of large language models trained on Portuguese texts, is introduced and it is identified that math and coding are key abilities that need improvement.

Abstract

We introduce Sabiá-2, a family of large language models trained on Portuguese texts. The models are evaluated on a diverse range of exams, including entry-level tests for Brazilian universities, professional certification exams, and graduate-level exams for various disciplines such as accounting, economics, engineering, law and medicine. Our results reveal that our best model so far, Sabiá-2 Medium, matches or surpasses GPT-4's performance in 23 out of 64 exams and outperforms GPT-3.5 in 58 out of 64 exams. Notably, specialization has a significant impact on a model's performance without the need to increase its size, allowing us to offer Sabiá-2 Medium at a price per token that is 10 times cheaper than GPT-4. Finally, we identified that math and coding are key abilities that need improvement.
Paper Structure (15 sections, 7 figures, 5 tables)

This paper contains 15 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Results of Sabiá-2, GPT-3.5 Turbo and GPT-4 Turbo on Enade 2022 and 2023 exams, ordered from low to high based on Sabiá-2 performance. Sabiá-2 outperforms GPT-3.5 Turbo on most exams, except Control and Automation Engineering, and Medicine. The lowest accuracies were achieved in domains related to engineering and economics. However, GPT-4 Turbo demonstrates consistent high accuracy across the spectrum.
  • Figure 2: Performance of Sabiá-2 and other proprietary LLMs on benchmarks of university admission exams: ENEM and BLUEX. The benchmark includes 3 exams.
  • Figure 3: Performance of Sabiá-2 and other proprietary LLMs on benchmarks of undergraduate exams: ENADE, POSCOMP, and MREX.
  • Figure 4: Performance of Sabiá-2 and other proprietary LLMs on benchmarks of professional certification exams: OAB, CFCES, and Revalida.
  • Figure 5: Pricing versus Performance of Sabiá-2 and other proprietary LLMs on Exams taken after mid-2023. Sabiá-2 Medium demonstrates superior performance compared to Mistral Large and Claude 3 Sonnet with a 3-point advantage while being significantly more cost-effective, priced at 4.5 times less than Claude 3 Sonnet and 8 times less than Mistral Large.
  • ...and 2 more figures