Neither Here Nor There: Cross-Lingual Representation Dynamics of Code-Mixed Text in Multilingual Encoders

Debajyoti Mazumder; Divyansh Pathak; Prashant Kodali; Jasabanta Patro

Neither Here Nor There: Cross-Lingual Representation Dynamics of Code-Mixed Text in Multilingual Encoders

Debajyoti Mazumder, Divyansh Pathak, Prashant Kodali, Jasabanta Patro

Abstract

Multilingual encoder-based language models are widely adopted for code-mixed analysis tasks, yet we know surprisingly little about how they represent code-mixed inputs internally - or whether those representations meaningfully connect to the constituent languages being mixed. Using Hindi-English as a case study, we construct a unified trilingual corpus of parallel English, Hindi (Devanagari), and Romanized code-mixed sentences, and probe cross-lingual representation alignment across standard multilingual encoders and their code-mixed adapted variants via CKA, token-level saliency, and entropy-based uncertainty analysis. We find that while standard models align English and Hindi well, code-mixed inputs remain loosely connected to either language - and that continued pre-training on code-mixed data improves English-code-mixed alignment at the cost of English-Hindi alignment. Interpretability analyses further reveal a clear asymmetry: models process code-mixed text through an English-dominant semantic subspace, while native-script Hindi provides complementary signals that reduce representational uncertainty. Motivated by these findings, we introduce a trilingual post-training alignment objective that brings code-mixed representations closer to both constituent languages simultaneously, yielding more balanced cross-lingual alignment and downstream gains on sentiment analysis and hate speech detection - showing that grounding code-mixed representations in their constituent languages meaningfully helps cross-lingual understanding.

Neither Here Nor There: Cross-Lingual Representation Dynamics of Code-Mixed Text in Multilingual Encoders

Abstract

Paper Structure (43 sections, 13 equations, 9 figures, 13 tables, 1 algorithm)

This paper contains 43 sections, 13 equations, 9 figures, 13 tables, 1 algorithm.

Introduction:
Related work:
Dataset:
Experiments:
Models.
Cross-Lingual Alignment
Observations:
Interpretability Analysis
Our proposed: Trilingual Post-Training Alignment Stage
Cross-Lingual alignment loss.
Cross-Lingual alignment evaluation.
Observations
Downstream task validation.
Conclusion:
Additional Dataset Details
...and 28 more sections

Figures (9)

Figure 1: Vanilla multilingual model versus code-mixed adapted multilingual model. Here, S represents similarity.
Figure 2: Directional cross-lingual alignment accuracy across model families. Solid bars show retrieval from language A$\rightarrow$B, hatched bars show B$\rightarrow$A.
Figure 3: t-SNE visualization of sentence representations using mBERT and XLM-R-family models.
Figure 4: Directional dot-product retrieval accuracy across encoder models for EN–HI, EN–CM, and HI–CM, averaged across all layers. Solid bars denote A$\rightarrow$B and hatched bars denote B$\rightarrow$A.
Figure 5: Layer-wise CKA alignment for mBERT and XLM-R family. Each subplot shows cross-lingual representation alignment for EN-CM, EN-HI, and HI-CM across layers.
...and 4 more figures

Neither Here Nor There: Cross-Lingual Representation Dynamics of Code-Mixed Text in Multilingual Encoders

Abstract

Neither Here Nor There: Cross-Lingual Representation Dynamics of Code-Mixed Text in Multilingual Encoders

Authors

Abstract

Table of Contents

Figures (9)