Table of Contents
Fetching ...

Efficient Fine-Tuning Methods for Portuguese Question Answering: A Comparative Study of PEFT on BERTimbau and Exploratory Evaluation of Generative LLMs

Mariela M. Nina, Caio Veloso Costa, Lilian Berton, Didier A. Vega-Oliveros

Abstract

Although large language models have transformed natural language processing, their computational costs create accessibility barriers for low-resource languages such as Brazilian Portuguese. This work presents a systematic evaluation of Parameter-Efficient Fine-Tuning (PEFT) and quantization techniques applied to BERTimbau for Question Answering on SQuAD-BR, the Brazilian Portuguese translation of SQuAD v1. We evaluate 40 configurations combining four PEFT methods (LoRA, DoRA, QLoRA, QDoRA) across two model sizes (Base: 110M, Large: 335M parameters). Our findings reveal three critical insights: (1) LoRA achieves 95.8\% of baseline performance on BERTimbau-Large while reducing training time by 73.5\% (F1=81.32 vs 84.86); (2) higher learning rates (2e-4) substantially improve PEFT performance, with F1 gains of up to +19.71 points over standard rates; and (3) larger models show twice the quantization resilience (loss of 4.83 vs 9.56 F1 points). These results demonstrate that encoder-based models can be efficiently fine-tuned for extractive Brazilian Portuguese QA with substantially lower computational cost than large generative LLMs, promoting more sustainable approaches aligned with \textit{Green AI} principles. An exploratory evaluation of Tucano and Sabiá on the same extractive QA benchmark shows that while generative models can reach competitive F1 scores with LoRA fine-tuning, they require up to 4.2$\times$ more GPU memory and 3$\times$ more training time than BERTimbau-Base, reinforcing the efficiency advantage of smaller encoder-based architectures for this task.

Efficient Fine-Tuning Methods for Portuguese Question Answering: A Comparative Study of PEFT on BERTimbau and Exploratory Evaluation of Generative LLMs

Abstract

Although large language models have transformed natural language processing, their computational costs create accessibility barriers for low-resource languages such as Brazilian Portuguese. This work presents a systematic evaluation of Parameter-Efficient Fine-Tuning (PEFT) and quantization techniques applied to BERTimbau for Question Answering on SQuAD-BR, the Brazilian Portuguese translation of SQuAD v1. We evaluate 40 configurations combining four PEFT methods (LoRA, DoRA, QLoRA, QDoRA) across two model sizes (Base: 110M, Large: 335M parameters). Our findings reveal three critical insights: (1) LoRA achieves 95.8\% of baseline performance on BERTimbau-Large while reducing training time by 73.5\% (F1=81.32 vs 84.86); (2) higher learning rates (2e-4) substantially improve PEFT performance, with F1 gains of up to +19.71 points over standard rates; and (3) larger models show twice the quantization resilience (loss of 4.83 vs 9.56 F1 points). These results demonstrate that encoder-based models can be efficiently fine-tuned for extractive Brazilian Portuguese QA with substantially lower computational cost than large generative LLMs, promoting more sustainable approaches aligned with \textit{Green AI} principles. An exploratory evaluation of Tucano and Sabiá on the same extractive QA benchmark shows that while generative models can reach competitive F1 scores with LoRA fine-tuning, they require up to 4.2 more GPU memory and 3 more training time than BERTimbau-Base, reinforcing the efficiency advantage of smaller encoder-based architectures for this task.
Paper Structure (20 sections, 3 equations, 2 figures, 6 tables)

This paper contains 20 sections, 3 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: F1 score for all PEFT methods across all configurations (learning rate $\times$ epochs $\times$ architecture). A bold border remarks the single best overall result per column; gray dashed borders show the best PEFT method per column. When both criteria coincide (i.e., a PEFT method is also the best overall), they are shown together. Values in parentheses indicate training collapse of full fine-tuning (Full FT) on BERTimbau-Large under high learning rate (lr=$2\times10^{-4}$), observed across both 2 and 3 epoch settings.
  • Figure 2: Peak GPU memory consumption per method and architecture. Quantized methods (QLoRA, QDoRA) reduce memory by up to 86.9% for Base and 81.9% for Large relative to Full FT. Memory values for generative models (Tucano-1B, Sabiá-7B) are included for cross-architecture comparison; results for those models are discussed in Section \ref{['sec:generative']}. The dashed red line marks the 20 GB GPU limit of the hardware used.