Table of Contents
Fetching ...

MuRIL: Multilingual Representations for Indian Languages

Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, Savya Khosla, Atreyee Dey, Balaji Gopalan, Dilip Kumar Margam, Pooja Aggarwal, Rajiv Teja Nagipogu, Shachi Dave, Shruti Gupta, Subhash Chandra Bose Gali, Vish Subramanian, Partha Talukdar

TL;DR

MuRIL introduces a multilingual language model tailored to Indian languages by combining masked language modeling on monolingual IN data with translation language modeling on translated and transliterated parallel data. It balances language representation through upsampling and uses a vocabulary focused on IN languages, achieving significant improvements over mBERT on zero-shot XTREME-IN benchmarks, including transliterated test sets. The approach effectively addresses transliteration and code-mixing prevalent in IN language data and demonstrates strong cross-lingual transfer. The authors provide tooling and pretrained models (TFHub and HuggingFace) to facilitate practical adoption in Indian-language NLP applications.

Abstract

India is a multilingual society with 1369 rationalized languages and dialects being spoken across the country (INDIA, 2011). Of these, the 22 scheduled languages have a staggering total of 1.17 billion speakers and 121 languages have more than 10,000 speakers (INDIA, 2011). India also has the second largest (and an ever growing) digital footprint (Statista, 2020). Despite this, today's state-of-the-art multilingual systems perform suboptimally on Indian (IN) languages. This can be explained by the fact that multilingual language models (LMs) are often trained on 100+ languages together, leading to a small representation of IN languages in their vocabulary and training data. Multilingual LMs are substantially less effective in resource-lean scenarios (Wu and Dredze, 2020; Lauscher et al., 2020), as limited data doesn't help capture the various nuances of a language. One also commonly observes IN language text transliterated to Latin or code-mixed with English, especially in informal settings (for example, on social media platforms) (Rijhwani et al., 2017). This phenomenon is not adequately handled by current state-of-the-art multilingual LMs. To address the aforementioned gaps, we propose MuRIL, a multilingual LM specifically built for IN languages. MuRIL is trained on significantly large amounts of IN text corpora only. We explicitly augment monolingual text corpora with both translated and transliterated document pairs, that serve as supervised cross-lingual signals in training. MuRIL significantly outperforms multilingual BERT (mBERT) on all tasks in the challenging cross-lingual XTREME benchmark (Hu et al., 2020). We also present results on transliterated (native to Latin script) test sets of the chosen datasets and demonstrate the efficacy of MuRIL in handling transliterated data.

MuRIL: Multilingual Representations for Indian Languages

TL;DR

MuRIL introduces a multilingual language model tailored to Indian languages by combining masked language modeling on monolingual IN data with translation language modeling on translated and transliterated parallel data. It balances language representation through upsampling and uses a vocabulary focused on IN languages, achieving significant improvements over mBERT on zero-shot XTREME-IN benchmarks, including transliterated test sets. The approach effectively addresses transliteration and code-mixing prevalent in IN language data and demonstrates strong cross-lingual transfer. The authors provide tooling and pretrained models (TFHub and HuggingFace) to facilitate practical adoption in Indian-language NLP applications.

Abstract

India is a multilingual society with 1369 rationalized languages and dialects being spoken across the country (INDIA, 2011). Of these, the 22 scheduled languages have a staggering total of 1.17 billion speakers and 121 languages have more than 10,000 speakers (INDIA, 2011). India also has the second largest (and an ever growing) digital footprint (Statista, 2020). Despite this, today's state-of-the-art multilingual systems perform suboptimally on Indian (IN) languages. This can be explained by the fact that multilingual language models (LMs) are often trained on 100+ languages together, leading to a small representation of IN languages in their vocabulary and training data. Multilingual LMs are substantially less effective in resource-lean scenarios (Wu and Dredze, 2020; Lauscher et al., 2020), as limited data doesn't help capture the various nuances of a language. One also commonly observes IN language text transliterated to Latin or code-mixed with English, especially in informal settings (for example, on social media platforms) (Rijhwani et al., 2017). This phenomenon is not adequately handled by current state-of-the-art multilingual LMs. To address the aforementioned gaps, we propose MuRIL, a multilingual LM specifically built for IN languages. MuRIL is trained on significantly large amounts of IN text corpora only. We explicitly augment monolingual text corpora with both translated and transliterated document pairs, that serve as supervised cross-lingual signals in training. MuRIL significantly outperforms multilingual BERT (mBERT) on all tasks in the challenging cross-lingual XTREME benchmark (Hu et al., 2020). We also present results on transliterated (native to Latin script) test sets of the chosen datasets and demonstrate the efficacy of MuRIL in handling transliterated data.

Paper Structure

This paper contains 7 sections, 1 equation, 8 figures, 14 tables.

Figures (8)

  • Figure 1: mBERT's (zero-shot) performance on Named Entity Recognition (NER). We observe significant differences between the performance on the English test set and other IN languages. This pattern is representative of current state-of-the-art multilingual models for Indian (IN) languages.
  • Figure 2: Upsampled Token Distribution. We upsample monolingual Wikipedia corpora as described in Section \ref{['model_data']}, to enhance low resource language representation in the pre-training data.
  • Figure 3: IN language words tokenized using mBERT (blue) and MuRIL (Red).
  • Figure 4: Percentage of WordPieces/Script in mBERT and MuRIL vocabularies. A WordPiece belongs to the category if all of its characters fall into the category or are digits.
  • Figure 5: Fertility Ratio for IN languages using mBERT and MuRIL tokenizers. Here, trans subsumes all IN languages transliterated from their native script to Latin.
  • ...and 3 more figures