ConLID: Supervised Contrastive Learning for Low-Resource Language Identification

Negar Foroutan; Jakhongir Saydaliev; Ye Eun Kim; Antoine Bosselut

ConLID: Supervised Contrastive Learning for Low-Resource Language Identification

Negar Foroutan, Jakhongir Saydaliev, Ye Eun Kim, Antoine Bosselut

TL;DR

This work proposes a novel supervised contrastive learning approach to learn domain-invariant representations for low-resource languages and shows that this approach improves LID performance on out-of-domain data for low-resource languages by 3.2 percentage points, while maintaining its performance for the high-resource languages.

Abstract

Language identification (LID) is a critical step in curating multilingual LLM pretraining corpora from web crawls. While many studies on LID model training focus on collecting diverse training data to improve performance, low-resource languages -- often limited to single-domain data, such as the Bible -- continue to perform poorly. To resolve these imbalance and bias issues, we propose a novel supervised contrastive learning (SCL) approach to learn domain-invariant representations for low-resource languages. We show that our approach improves LID performance on out-of-domain data for low-resource languages by 3.2 percentage points, while maintaining its performance for the high-resource languages.

ConLID: Supervised Contrastive Learning for Low-Resource Language Identification

TL;DR

Abstract

ConLID: Supervised Contrastive Learning for Low-Resource Language Identification

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)