Table of Contents
Fetching ...

PolyLingua: Margin-based Inter-class Transformer for Robust Cross-domain Language Detection

Ali Lotfi Rezaabad, Bikram Khanal, Shashwat Chaurasia, Lu Zeng, Dezhi Hong, Hossein Bashashati, Thomas Butler, Megan Ganji

TL;DR

Language identification is a critical bottleneck in multilingual AI systems, especially under code-switching and short utterances. The authors propose PolyLingua, a lightweight, multi-task Transformer that jointly performs in-domain detection and fine-grained language classification using a shared encoder and a two-level margin-based contrastive objective with adaptive inter-class margins. Empirical results on Amazon Massive and a synthetic Song dataset show PolyLingua achieves near-LLM accuracy with an order of magnitude fewer parameters and low latency, outperforming baselines and prompting fewer misclassifications among confusable languages. This approach offers a practical, scalable solution for robust, cross-domain language detection in latency-constrained environments.

Abstract

Language identification is a crucial first step in multilingual systems such as chatbots and virtual assistants, enabling linguistically and culturally accurate user experiences. Errors at this stage can cascade into downstream failures, setting a high bar for accuracy. Yet, existing language identification tools struggle with key cases -- such as music requests where the song title and user language differ. Open-source tools like LangDetect, FastText are fast but less accurate, while large language models, though effective, are often too costly for low-latency or low-resource settings. We introduce PolyLingua, a lightweight Transformer-based model for in-domain language detection and fine-grained language classification. It employs a two-level contrastive learning framework combining instance-level separation and class-level alignment with adaptive margins, yielding compact and well-separated embeddings even for closely related languages. Evaluated on two challenging datasets -- Amazon Massive (multilingual digital assistant utterances) and a Song dataset (music requests with frequent code-switching) -- PolyLingua achieves 99.25% F1 and 98.15% F1, respectively, surpassing Sonnet 3.5 while using 10x fewer parameters, making it ideal for compute- and latency-constrained environments.

PolyLingua: Margin-based Inter-class Transformer for Robust Cross-domain Language Detection

TL;DR

Language identification is a critical bottleneck in multilingual AI systems, especially under code-switching and short utterances. The authors propose PolyLingua, a lightweight, multi-task Transformer that jointly performs in-domain detection and fine-grained language classification using a shared encoder and a two-level margin-based contrastive objective with adaptive inter-class margins. Empirical results on Amazon Massive and a synthetic Song dataset show PolyLingua achieves near-LLM accuracy with an order of magnitude fewer parameters and low latency, outperforming baselines and prompting fewer misclassifications among confusable languages. This approach offers a practical, scalable solution for robust, cross-domain language detection in latency-constrained environments.

Abstract

Language identification is a crucial first step in multilingual systems such as chatbots and virtual assistants, enabling linguistically and culturally accurate user experiences. Errors at this stage can cascade into downstream failures, setting a high bar for accuracy. Yet, existing language identification tools struggle with key cases -- such as music requests where the song title and user language differ. Open-source tools like LangDetect, FastText are fast but less accurate, while large language models, though effective, are often too costly for low-latency or low-resource settings. We introduce PolyLingua, a lightweight Transformer-based model for in-domain language detection and fine-grained language classification. It employs a two-level contrastive learning framework combining instance-level separation and class-level alignment with adaptive margins, yielding compact and well-separated embeddings even for closely related languages. Evaluated on two challenging datasets -- Amazon Massive (multilingual digital assistant utterances) and a Song dataset (music requests with frequent code-switching) -- PolyLingua achieves 99.25% F1 and 98.15% F1, respectively, surpassing Sonnet 3.5 while using 10x fewer parameters, making it ideal for compute- and latency-constrained environments.

Paper Structure

This paper contains 13 sections, 5 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Given $u_i$ and a positive $u_p$ from the same class, and a negative $u_n$ from a different language, PolyLingua aligns $u_i$ with $u_p$ while pushing it away from $u_n$, using a shared encoder and class-aware margin-based contrastive losses.
  • Figure 2: UMAP projections of utterance embeddings from Amazon Massive (top row) and the Song dataset (bottom row). Each point represents an utterance and is colored by its true language label. On Amazon Massive, PolyLingua forms more compact and well-separated clusters than XLM-LID and Baseline+SupCon, especially for similar languages such as French, Portuguese, and Spanish, due to its margin-based inter-class separation. On the Song dataset, which includes noisy utterances with diverse and multilingual artist and song entities, PolyLingua again shows clearer clustering, demonstrating robustness to intra-class variation and entity-induced noise.
  • Figure 3: Left: Confusion matrix for the proposed PolyLingua model on 10 in-domain languages, demonstrating low misclassification. Center: Difference in the normalized confusion matrices between PolyLingua and the Baseline+SupCon model. Right: Difference between PolyLingua and the standard Baseline. Blue cells on the diagonal indicate improvements in true positive rates by PolyLingua, while red cells off the diagonal represent reductions in misclassification and confusion. Arrow pointer indicates the improvement in the performance in the Difference confusion matrix .
  • Figure 4: Comparison of cosine similarity distributions between PolyLingua and baseline models on the Amazon Massive dataset. (a) and (b) show positive and negative pair distributions for PolyLingua against XLM-LID and Baseline+SupCon, respectively. (c) shows difference histograms, indicating better class separation in PolyLingua.
  • Figure 5: Distribution of language labels in the Amazon MASSIVE dataset. We consider ten languages as in-domain; all others are grouped under out_domain.
  • ...and 1 more figures