Unraveling Code-Mixing Patterns in Migration Discourse: Automated Detection and Analysis of Online Conversations on Reddit

Fedor Vitiugin; Sunok Lee; Henna Paakki; Anastasiia Chizhikova; Nitin Sawhney

Unraveling Code-Mixing Patterns in Migration Discourse: Automated Detection and Analysis of Online Conversations on Reddit

Fedor Vitiugin, Sunok Lee, Henna Paakki, Anastasiia Chizhikova, Nitin Sawhney

TL;DR

The paper addresses the challenge of understanding code-mixed migration discourse on Reddit to enhance inclusive digital public services. It introduces ELMICT, an ensemble method that fuses multiple tokenizers and soft labels from a fine-tuned large language model to detect code-mixed text, evaluated in monolingual English–Finnish and cross-lingual zero-shot settings, and paired with BERTopic-based topic modeling to reveal where code-mixing concentrates. ELMICT achieves $F1$ > 0.95 for English–Finnish detection and maintains $F1$ > 0.70 in cross-lingual scenarios, while topic analysis highlights high code-mixing in housing, employment, and public utilities topics among migrants. The approach supports building trust in multilingual public services and informs design of conversational systems for migrant communities, though future work should broaden language coverage, address named-entity challenges, and extend reproducibility.

Abstract

The surge in global migration patterns underscores the imperative of integrating migrants seamlessly into host communities, necessitating inclusive and trustworthy public services. Despite the Nordic countries' robust public sector infrastructure, recent immigrants often encounter barriers to accessing these services, exacerbating social disparities and eroding trust. Addressing digital inequalities and linguistic diversity is paramount in this endeavor. This paper explores the utilization of code-mixing, a communication strategy prevalent among multilingual speakers, in migration-related discourse on social media platforms such as Reddit. We present Ensemble Learning for Multilingual Identification of Code-mixed Texts (ELMICT), a novel approach designed to automatically detect code-mixed messages in migration-related discussions. Leveraging ensemble learning techniques for combining multiple tokenizers' outputs and pre-trained language models, ELMICT demonstrates high performance (with F1 more than 0.95) in identifying code-mixing across various languages and contexts, particularly in cross-lingual zero-shot conditions (with avg. F1 more than 0.70). Moreover, the utilization of ELMICT helps to analyze the prevalence of code-mixing in migration-related threads compared to other thematic categories on Reddit, shedding light on the topics of concern to migrant communities. Our findings reveal insights into the communicative strategies employed by migrants on social media platforms, offering implications for the development of inclusive digital public services and conversational systems. By addressing the research questions posed in this study, we contribute to the understanding of linguistic diversity in migration discourse and pave the way for more effective tools for building trust in multicultural societies.

Unraveling Code-Mixing Patterns in Migration Discourse: Automated Detection and Analysis of Online Conversations on Reddit

TL;DR

> 0.95 for English–Finnish detection and maintains

> 0.70 in cross-lingual scenarios, while topic analysis highlights high code-mixing in housing, employment, and public utilities topics among migrants. The approach supports building trust in multilingual public services and informs design of conversational systems for migrant communities, though future work should broaden language coverage, address named-entity challenges, and extend reproducibility.

Abstract

Paper Structure (19 sections, 3 figures, 5 tables)

This paper contains 19 sections, 3 figures, 5 tables.

Introduction
Related Work
Code-Mixing and Code-Switching
The Role of Code-Mixing in Migrant Communication
Code-Mixed Data Processing
Method
Text Classification
Topic Modeling
Model Implementation
Experiment Setup
Data collection and annotation
Schemes
Result Analysis and Discussion
English-Finnish Code-Mixing Detection
Cross-lingual Code-Mixing Detection
...and 4 more sections

Figures (3)

Figure 1: Example of differences in single-language pre-trained model tokenizer outputs.
Figure 2: ELMICT model architecture.
Figure 3: Proportion of code-mixing messages per topic per flair in English-Finnish dataset.

Unraveling Code-Mixing Patterns in Migration Discourse: Automated Detection and Analysis of Online Conversations on Reddit

TL;DR

Abstract

Unraveling Code-Mixing Patterns in Migration Discourse: Automated Detection and Analysis of Online Conversations on Reddit

Authors

TL;DR

Abstract

Table of Contents

Figures (3)