Table of Contents
Fetching ...

Portuguese Named Entity Recognition using BERT-CRF

Fábio Souza, Rodrigo Nogueira, Roberto Lotufo

TL;DR

This work addresses Portuguese Named Entity Recognition under data-scarce conditions by pre-training Portuguese BERT models on a large unlabeled brWaC corpus and applying a BERT-CRF architecture. It compares feature-based and fine-tuning transfer-learning strategies, leveraging document-context spans to capture longer dependencies. The fine-tuned BERT-CRF model achieves state-of-the-art results on the HAREM I dataset, with improvements of about 1 F1 point in the selective scenario and 4 points in the total scenario, outperforming prior BiLSTM-CRF+FlairBBP approaches. The study demonstrates the effectiveness of monolingual Portuguese BERTs for NER and provides publicly available resources to support reproducibility and further research in Portuguese NLP.

Abstract

Recent advances in language representation using neural networks have made it viable to transfer the learned internal states of a trained model to downstream natural language processing tasks, such as named entity recognition (NER) and question answering. It has been shown that the leverage of pre-trained language models improves the overall performance on many tasks and is highly beneficial when labeled data is scarce. In this work, we train Portuguese BERT models and employ a BERT-CRF architecture to the NER task on the Portuguese language, combining the transfer capabilities of BERT with the structured predictions of CRF. We explore feature-based and fine-tuning training strategies for the BERT model. Our fine-tuning approach obtains new state-of-the-art results on the HAREM I dataset, improving the F1-score by 1 point on the selective scenario (5 NE classes) and by 4 points on the total scenario (10 NE classes).

Portuguese Named Entity Recognition using BERT-CRF

TL;DR

This work addresses Portuguese Named Entity Recognition under data-scarce conditions by pre-training Portuguese BERT models on a large unlabeled brWaC corpus and applying a BERT-CRF architecture. It compares feature-based and fine-tuning transfer-learning strategies, leveraging document-context spans to capture longer dependencies. The fine-tuned BERT-CRF model achieves state-of-the-art results on the HAREM I dataset, with improvements of about 1 F1 point in the selective scenario and 4 points in the total scenario, outperforming prior BiLSTM-CRF+FlairBBP approaches. The study demonstrates the effectiveness of monolingual Portuguese BERTs for NER and provides publicly available resources to support reproducibility and further research in Portuguese NLP.

Abstract

Recent advances in language representation using neural networks have made it viable to transfer the learned internal states of a trained model to downstream natural language processing tasks, such as named entity recognition (NER) and question answering. It has been shown that the leverage of pre-trained language models improves the overall performance on many tasks and is highly beneficial when labeled data is scarce. In this work, we train Portuguese BERT models and employ a BERT-CRF architecture to the NER task on the Portuguese language, combining the transfer capabilities of BERT with the structured predictions of CRF. We explore feature-based and fine-tuning training strategies for the BERT model. Our fine-tuning approach obtains new state-of-the-art results on the HAREM I dataset, improving the F1-score by 1 point on the selective scenario (5 NE classes) and by 4 points on the total scenario (10 NE classes).

Paper Structure

This paper contains 21 sections, 2 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Illustration of the proposed method. Given an input document, the text is tokenized using WordPiece wu2016google and the tokenized document is split into overlapping spans of the maximum length using a defined stride (with a stride of 3 in the example). Maximum context tokens of each span are marked in bold. The spans are fed into BERT and then into the classification layer, producing a sequence of tag scores for each span. The sub-token entries (starting with ##) are removed from the spans and the remaining tokens are passed to the CRF layer. The maximum context tokens are selected and concatenated to form the final predicted tags.