Table of Contents
Fetching ...

MaskLID: Code-Switching Language Identification through Iterative Masking

Amir Hossein Kargaran, François Yvon, Hinrich Schütze

TL;DR

MaskLID introduces a training-free method to detect code-switching within texts by masking features associated with the dominant language in a sentence-level LID and re-evaluating the text to reveal minority languages. It leverages FastText-based LID models (GlotLID/OpenLID) as backbones and frames CS detection as a set-prediction problem, guided by an iterative masking algorithm with parameters $\alpha$, $\beta$, $\tau$, and $\lambda$ that control strong/weak associations, minimum length, and iteration count. Empirically, MaskLID substantially boosts CS detection across Turkish-English, Basque-Spanish, Hindi-English, and Nepali-English datasets, while preserving strong performance on monolingual instances; exact CS gains include increases from single-digit detections to tens or more, depending on language pair. The approach enables scalable CS data mining for downstream CS-aware NLP applications and can be extended to subword features and other LID models, with future work exploring interpretable mappings (e.g., via LIME) and web-scale deployment $.$

Abstract

We present MaskLID, a simple, yet effective, code-switching (CS) language identification (LID) method. MaskLID does not require any training and is designed to complement current high-performance sentence-level LIDs. Sentence-level LIDs are classifiers trained on monolingual texts to provide single labels, typically using a softmax layer to turn scores into probabilities. However, in cases where a sentence is composed in both L1 and L2 languages, the LID classifier often only returns the dominant label L1. To address this limitation, MaskLID employs a strategy to mask text features associated with L1, allowing the LID to classify the text as L2 in the next round. This method uses the LID itself to identify the features that require masking and does not rely on any external resource. In this work, we explore the use of MaskLID for two open-source LIDs (GlotLID and OpenLID), that are both based on the FastText architecture. Code and demo are available at https://github.com/cisnlp/MaskLID.

MaskLID: Code-Switching Language Identification through Iterative Masking

TL;DR

MaskLID introduces a training-free method to detect code-switching within texts by masking features associated with the dominant language in a sentence-level LID and re-evaluating the text to reveal minority languages. It leverages FastText-based LID models (GlotLID/OpenLID) as backbones and frames CS detection as a set-prediction problem, guided by an iterative masking algorithm with parameters , , , and that control strong/weak associations, minimum length, and iteration count. Empirically, MaskLID substantially boosts CS detection across Turkish-English, Basque-Spanish, Hindi-English, and Nepali-English datasets, while preserving strong performance on monolingual instances; exact CS gains include increases from single-digit detections to tens or more, depending on language pair. The approach enables scalable CS data mining for downstream CS-aware NLP applications and can be extended to subword features and other LID models, with future work exploring interpretable mappings (e.g., via LIME) and web-scale deployment

Abstract

We present MaskLID, a simple, yet effective, code-switching (CS) language identification (LID) method. MaskLID does not require any training and is designed to complement current high-performance sentence-level LIDs. Sentence-level LIDs are classifiers trained on monolingual texts to provide single labels, typically using a softmax layer to turn scores into probabilities. However, in cases where a sentence is composed in both L1 and L2 languages, the LID classifier often only returns the dominant label L1. To address this limitation, MaskLID employs a strategy to mask text features associated with L1, allowing the LID to classify the text as L2 in the next round. This method uses the LID itself to identify the features that require masking and does not rely on any external resource. In this work, we explore the use of MaskLID for two open-source LIDs (GlotLID and OpenLID), that are both based on the FastText architecture. Code and demo are available at https://github.com/cisnlp/MaskLID.
Paper Structure (21 sections, 2 equations, 1 table)