Table of Contents
Fetching ...

Enhancing Portuguese Variety Identification with Cross-Domain Approaches

Hugo Sousa, Rúben Almeida, Purificação Silvano, Inês Cantante, Ricardo Campos, Alípio Jorge

TL;DR

The paper tackles cross-domain language variety identification for Portuguese, targeting European vs Brazilian varieties to reduce Brazil-centric bias in NLP resources. It introduces PtBrVarId, a silver-labeled, multi-domain corpus built from 11 sources across six domains, and proposes a cross-domain training protocol with delexicalization of named entities and thematic content. A BERTimbau-based LVI system, trained via the proposed protocol, outperforms a strong N-gram baseline and achieves notable $F_1$ scores on DSL-TL ($84.97\%$) and FRMT ($77.25\%$) when using delexicalized training data, underscoring improved generalization. The work provides a large open-resource for Portuguese LVI and demonstrates a practical method to extend variety-aware NLP to additional languages or varieties.

Abstract

Recent advances in natural language processing have raised expectations for generative models to produce coherent text across diverse language varieties. In the particular case of the Portuguese language, the predominance of Brazilian Portuguese corpora online introduces linguistic biases in these models, limiting their applicability outside of Brazil. To address this gap and promote the creation of European Portuguese resources, we developed a cross-domain language variety identifier (LVI) to discriminate between European and Brazilian Portuguese. Motivated by the findings of our literature review, we compiled the PtBrVarId corpus, a cross-domain LVI dataset, and study the effectiveness of transformer-based LVI classifiers for cross-domain scenarios. Although this research focuses on two Portuguese varieties, our contribution can be extended to other varieties and languages. We open source the code, corpus, and models to foster further research in this task.

Enhancing Portuguese Variety Identification with Cross-Domain Approaches

TL;DR

The paper tackles cross-domain language variety identification for Portuguese, targeting European vs Brazilian varieties to reduce Brazil-centric bias in NLP resources. It introduces PtBrVarId, a silver-labeled, multi-domain corpus built from 11 sources across six domains, and proposes a cross-domain training protocol with delexicalization of named entities and thematic content. A BERTimbau-based LVI system, trained via the proposed protocol, outperforms a strong N-gram baseline and achieves notable scores on DSL-TL () and FRMT () when using delexicalized training data, underscoring improved generalization. The work provides a large open-resource for Portuguese LVI and demonstrates a practical method to extend variety-aware NLP to additional languages or varieties.

Abstract

Recent advances in natural language processing have raised expectations for generative models to produce coherent text across diverse language varieties. In the particular case of the Portuguese language, the predominance of Brazilian Portuguese corpora online introduces linguistic biases in these models, limiting their applicability outside of Brazil. To address this gap and promote the creation of European Portuguese resources, we developed a cross-domain language variety identifier (LVI) to discriminate between European and Brazilian Portuguese. Motivated by the findings of our literature review, we compiled the PtBrVarId corpus, a cross-domain LVI dataset, and study the effectiveness of transformer-based LVI classifiers for cross-domain scenarios. Although this research focuses on two Portuguese varieties, our contribution can be extended to other varieties and languages. We open source the code, corpus, and models to foster further research in this task.

Paper Structure

This paper contains 19 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Average $F_1$ score for each ($P_\text{POS}$, $P_\text{NER}$).
  • Figure 2: $F_1$ in FRMT and DSL-TL benchmarks. Models with the subscript $d$ were trained on a delexicalized corpus.