Table of Contents
Fetching ...

Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT

Shijie Wu, Mark Dredze

TL;DR

This paper evaluates multilingual BERT (mBERT) as a zero-shot cross-lingual transfer model across five NLP tasks and 39 languages, showing competitive or state-of-the-art performance without explicit cross-lingual signals. It analyzes how to best fine-tune or repurpose mBERT (e.g., freezing bottom layers) and reveals that language-specific information persists across layers, while subword sharing across languages strongly correlates with transfer success. The findings indicate mBERT is a robust foundation for multilingual NLP, with potential gains from limited target-language supervision and cross-lingual signals, and point to future work on weak supervision and subword-based transfer strategies. Overall, mBERT demonstrates surprisingly strong cross-lingual generalization, outperforming traditional cross-lingual embeddings on several tasks and languages. The work informs practical deployment choices and future research on multilingual representation learning.

Abstract

Pretrained contextual representation models (Peters et al., 2018; Devlin et al., 2018) have pushed forward the state-of-the-art on many NLP tasks. A new release of BERT (Devlin, 2018) includes a model simultaneously pretrained on 104 languages with impressive performance for zero-shot cross-lingual transfer on a natural language inference task. This paper explores the broader cross-lingual potential of mBERT (multilingual) as a zero shot language transfer model on 5 NLP tasks covering a total of 39 languages from various language families: NLI, document classification, NER, POS tagging, and dependency parsing. We compare mBERT with the best-published methods for zero-shot cross-lingual transfer and find mBERT competitive on each task. Additionally, we investigate the most effective strategy for utilizing mBERT in this manner, determine to what extent mBERT generalizes away from language specific features, and measure factors that influence cross-lingual transfer.

Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT

TL;DR

This paper evaluates multilingual BERT (mBERT) as a zero-shot cross-lingual transfer model across five NLP tasks and 39 languages, showing competitive or state-of-the-art performance without explicit cross-lingual signals. It analyzes how to best fine-tune or repurpose mBERT (e.g., freezing bottom layers) and reveals that language-specific information persists across layers, while subword sharing across languages strongly correlates with transfer success. The findings indicate mBERT is a robust foundation for multilingual NLP, with potential gains from limited target-language supervision and cross-lingual signals, and point to future work on weak supervision and subword-based transfer strategies. Overall, mBERT demonstrates surprisingly strong cross-lingual generalization, outperforming traditional cross-lingual embeddings on several tasks and languages. The work informs practical deployment choices and future research on multilingual representation learning.

Abstract

Pretrained contextual representation models (Peters et al., 2018; Devlin et al., 2018) have pushed forward the state-of-the-art on many NLP tasks. A new release of BERT (Devlin, 2018) includes a model simultaneously pretrained on 104 languages with impressive performance for zero-shot cross-lingual transfer on a natural language inference task. This paper explores the broader cross-lingual potential of mBERT (multilingual) as a zero shot language transfer model on 5 NLP tasks covering a total of 39 languages from various language families: NLI, document classification, NER, POS tagging, and dependency parsing. We compare mBERT with the best-published methods for zero-shot cross-lingual transfer and find mBERT competitive on each task. Additionally, we investigate the most effective strategy for utilizing mBERT in this manner, determine to what extent mBERT generalizes away from language specific features, and measure factors that influence cross-lingual transfer.

Paper Structure

This paper contains 31 sections, 5 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Performance of different fine-tuning approaches compared with fine-tuning all mBERT parameters. Color denotes absolute difference and number in each entry is the evaluation in the corresponding setting. Languages are sorted by mBERT zero-shot transfer performance. Three downward triangles indicate performance drop more than the legends lower limit.
  • Figure 2: Language identification accuracy for different layer of mBERT. layer 0 is the embedding layer and the layer $i > 0$ is output of the i$^\text{th}$ transformer block.
  • Figure 3: Relation between cross-lingual zero-shot transfer performance with mBERT and percentage of observed subwords at both type-level and token-level. Pearson correlation coefficient and $p$-value are shown in red.