Table of Contents
Fetching ...

Neither Here Nor There: Cross-Lingual Representation Dynamics of Code-Mixed Text in Multilingual Encoders

Debajyoti Mazumder, Divyansh Pathak, Prashant Kodali, Jasabanta Patro

Abstract

Multilingual encoder-based language models are widely adopted for code-mixed analysis tasks, yet we know surprisingly little about how they represent code-mixed inputs internally - or whether those representations meaningfully connect to the constituent languages being mixed. Using Hindi-English as a case study, we construct a unified trilingual corpus of parallel English, Hindi (Devanagari), and Romanized code-mixed sentences, and probe cross-lingual representation alignment across standard multilingual encoders and their code-mixed adapted variants via CKA, token-level saliency, and entropy-based uncertainty analysis. We find that while standard models align English and Hindi well, code-mixed inputs remain loosely connected to either language - and that continued pre-training on code-mixed data improves English-code-mixed alignment at the cost of English-Hindi alignment. Interpretability analyses further reveal a clear asymmetry: models process code-mixed text through an English-dominant semantic subspace, while native-script Hindi provides complementary signals that reduce representational uncertainty. Motivated by these findings, we introduce a trilingual post-training alignment objective that brings code-mixed representations closer to both constituent languages simultaneously, yielding more balanced cross-lingual alignment and downstream gains on sentiment analysis and hate speech detection - showing that grounding code-mixed representations in their constituent languages meaningfully helps cross-lingual understanding.

Neither Here Nor There: Cross-Lingual Representation Dynamics of Code-Mixed Text in Multilingual Encoders

Abstract

Multilingual encoder-based language models are widely adopted for code-mixed analysis tasks, yet we know surprisingly little about how they represent code-mixed inputs internally - or whether those representations meaningfully connect to the constituent languages being mixed. Using Hindi-English as a case study, we construct a unified trilingual corpus of parallel English, Hindi (Devanagari), and Romanized code-mixed sentences, and probe cross-lingual representation alignment across standard multilingual encoders and their code-mixed adapted variants via CKA, token-level saliency, and entropy-based uncertainty analysis. We find that while standard models align English and Hindi well, code-mixed inputs remain loosely connected to either language - and that continued pre-training on code-mixed data improves English-code-mixed alignment at the cost of English-Hindi alignment. Interpretability analyses further reveal a clear asymmetry: models process code-mixed text through an English-dominant semantic subspace, while native-script Hindi provides complementary signals that reduce representational uncertainty. Motivated by these findings, we introduce a trilingual post-training alignment objective that brings code-mixed representations closer to both constituent languages simultaneously, yielding more balanced cross-lingual alignment and downstream gains on sentiment analysis and hate speech detection - showing that grounding code-mixed representations in their constituent languages meaningfully helps cross-lingual understanding.
Paper Structure (43 sections, 13 equations, 9 figures, 13 tables, 1 algorithm)

This paper contains 43 sections, 13 equations, 9 figures, 13 tables, 1 algorithm.

Figures (9)

  • Figure 1: Vanilla multilingual model versus code-mixed adapted multilingual model. Here, S represents similarity.
  • Figure 2: Directional cross-lingual alignment accuracy across model families. Solid bars show retrieval from language A$\rightarrow$B, hatched bars show B$\rightarrow$A.
  • Figure 3: t-SNE visualization of sentence representations using mBERT and XLM-R-family models.
  • Figure 4: Directional dot-product retrieval accuracy across encoder models for EN–HI, EN–CM, and HI–CM, averaged across all layers. Solid bars denote A$\rightarrow$B and hatched bars denote B$\rightarrow$A.
  • Figure 5: Layer-wise CKA alignment for mBERT and XLM-R family. Each subplot shows cross-lingual representation alignment for EN-CM, EN-HI, and HI-CM across layers.
  • ...and 4 more figures