Table of Contents
Fetching ...

ConLID: Supervised Contrastive Learning for Low-Resource Language Identification

Negar Foroutan, Jakhongir Saydaliev, Ye Eun Kim, Antoine Bosselut

TL;DR

This work proposes a novel supervised contrastive learning approach to learn domain-invariant representations for low-resource languages and shows that this approach improves LID performance on out-of-domain data for low-resource languages by 3.2 percentage points, while maintaining its performance for the high-resource languages.

Abstract

Language identification (LID) is a critical step in curating multilingual LLM pretraining corpora from web crawls. While many studies on LID model training focus on collecting diverse training data to improve performance, low-resource languages -- often limited to single-domain data, such as the Bible -- continue to perform poorly. To resolve these imbalance and bias issues, we propose a novel supervised contrastive learning (SCL) approach to learn domain-invariant representations for low-resource languages. We show that our approach improves LID performance on out-of-domain data for low-resource languages by 3.2 percentage points, while maintaining its performance for the high-resource languages.

ConLID: Supervised Contrastive Learning for Low-Resource Language Identification

TL;DR

This work proposes a novel supervised contrastive learning approach to learn domain-invariant representations for low-resource languages and shows that this approach improves LID performance on out-of-domain data for low-resource languages by 3.2 percentage points, while maintaining its performance for the high-resource languages.

Abstract

Language identification (LID) is a critical step in curating multilingual LLM pretraining corpora from web crawls. While many studies on LID model training focus on collecting diverse training data to improve performance, low-resource languages -- often limited to single-domain data, such as the Bible -- continue to perform poorly. To resolve these imbalance and bias issues, we propose a novel supervised contrastive learning (SCL) approach to learn domain-invariant representations for low-resource languages. We show that our approach improves LID performance on out-of-domain data for low-resource languages by 3.2 percentage points, while maintaining its performance for the high-resource languages.

Paper Structure

This paper contains 44 sections, 4 equations, 4 figures, 21 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of training ConLID. Each sentence is first processed by an encoder that generates a sentence representation. This representation is then passed to two components: (a) a Supervised Contrastive Learning (SCL) module, and (b) a feed-forward neural network that serves as the classification head. The final loss used for training is a combination of the classification loss and the SCL loss.
  • Figure 2: Performance of ConLID-S on the UDHR dataset based on training data domains. The more domains included in the training data, the higher performance the model shows for a given language. The size of the training data is not always positively correlated with model's performance, particularly in cases where languages are represented in limited domains.
  • Figure 3: UDHR misclassification pairs (languages with F1 $<$0.8) using ConLID-S. Most misclassification for a language happens between less than 5 languages.
  • Figure 4: Ratio of the recovered text by applying 5 levels of filtering on FineWeb-2 UND_XXX dataset. The prediction is carried out by ConLID-S model.