Table of Contents
Fetching ...

From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences

Prashant Kodali, Anmol Goel, Likhith Asapu, Vamshi Krishna Bonagiri, Anirudh Govil, Monojit Choudhury, Ponnurangam Kumaraguru, Manish Shrivastava

TL;DR

This work addresses how to quantify and predict the acceptability of code-mixed sentences, specifically English-Hindi. It introduces the Cline dataset, combining social-network and synthetically generated text annotated for acceptability, enabling rigorous evaluation beyond traditional code-mixing metrics. The study shows that common metrics poorly predict acceptability, while fine-tuned multilingual LLMs—especially decoder-only models like Llama 3.2-3B—achieve superior predictive performance and can transfer to related language pairs (e.g., En-Hi to En-Te). These findings have practical implications for data curation and generation of code-mixed content, and for building robust multilingual NLP systems that better reflect human judgments of naturalness.

Abstract

Current computational approaches for analysing or generating code-mixed sentences do not explicitly model ``naturalness'' or ``acceptability'' of code-mixed sentences, but rely on training corpora to reflect distribution of acceptable code-mixed sentences. Modelling human judgement for the acceptability of code-mixed text can help in distinguishing natural code-mixed text and enable quality-controlled generation of code-mixed text. To this end, we construct Cline - a dataset containing human acceptability judgements for English-Hindi~(en-hi) code-mixed text. Cline is the largest of its kind with 16,642 sentences, consisting of samples sourced from two sources: synthetically generated code-mixed text and samples collected from online social media. Our analysis establishes that popular code-mixing metrics such as CMI, Number of Switch Points, Burstines, which are used to filter/curate/compare code-mixed corpora have low correlation with human acceptability judgements, underlining the necessity of our dataset. Experiments using Cline demonstrate that simple Multilayer Perceptron (MLP) models when trained solely using code-mixing metrics as features are outperformed by fine-tuned pre-trained Multilingual Large Language Models (MLLMs). Specifically, among Encoder models XLM-Roberta and Bernice outperform IndicBERT across different configurations. Among Encoder-Decoder models, mBART performs better than mT5, however Encoder-Decoder models are not able to outperform Encoder-only models. Decoder-only models perform the best when compared to all other MLLMS, with Llama 3.2 - 3B models outperforming similarly sized Qwen, Phi models. Comparison with zero and fewshot capabilitites of ChatGPT show that MLLMs fine-tuned on larger data outperform ChatGPT, providing scope for improvement in code-mixed tasks. Zero-shot transfer from En-Hi to En-Te acceptability judgments are better than random baselines.

From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences

TL;DR

This work addresses how to quantify and predict the acceptability of code-mixed sentences, specifically English-Hindi. It introduces the Cline dataset, combining social-network and synthetically generated text annotated for acceptability, enabling rigorous evaluation beyond traditional code-mixing metrics. The study shows that common metrics poorly predict acceptability, while fine-tuned multilingual LLMs—especially decoder-only models like Llama 3.2-3B—achieve superior predictive performance and can transfer to related language pairs (e.g., En-Hi to En-Te). These findings have practical implications for data curation and generation of code-mixed content, and for building robust multilingual NLP systems that better reflect human judgments of naturalness.

Abstract

Current computational approaches for analysing or generating code-mixed sentences do not explicitly model ``naturalness'' or ``acceptability'' of code-mixed sentences, but rely on training corpora to reflect distribution of acceptable code-mixed sentences. Modelling human judgement for the acceptability of code-mixed text can help in distinguishing natural code-mixed text and enable quality-controlled generation of code-mixed text. To this end, we construct Cline - a dataset containing human acceptability judgements for English-Hindi~(en-hi) code-mixed text. Cline is the largest of its kind with 16,642 sentences, consisting of samples sourced from two sources: synthetically generated code-mixed text and samples collected from online social media. Our analysis establishes that popular code-mixing metrics such as CMI, Number of Switch Points, Burstines, which are used to filter/curate/compare code-mixed corpora have low correlation with human acceptability judgements, underlining the necessity of our dataset. Experiments using Cline demonstrate that simple Multilayer Perceptron (MLP) models when trained solely using code-mixing metrics as features are outperformed by fine-tuned pre-trained Multilingual Large Language Models (MLLMs). Specifically, among Encoder models XLM-Roberta and Bernice outperform IndicBERT across different configurations. Among Encoder-Decoder models, mBART performs better than mT5, however Encoder-Decoder models are not able to outperform Encoder-only models. Decoder-only models perform the best when compared to all other MLLMS, with Llama 3.2 - 3B models outperforming similarly sized Qwen, Phi models. Comparison with zero and fewshot capabilitites of ChatGPT show that MLLMs fine-tuned on larger data outperform ChatGPT, providing scope for improvement in code-mixed tasks. Zero-shot transfer from En-Hi to En-Te acceptability judgments are better than random baselines.
Paper Structure (16 sections, 3 equations, 9 figures, 10 tables)

This paper contains 16 sections, 3 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Overview of our study involving three prime components: a) creating code-mixed samples; b) collecting human judgments on acceptability; c) evaluating of code-mix metrics and training predictive models. For perturbations to OSN text L1, L2 indicate the language ID of the tokens in a code-mixed sentence (e.g L1 could be Hindi, and L2 could be English).
  • Figure 2: Illustrative example of how Samanantar corpus and GCM toolkit are used to create synthetic code-mixed sentences which are annotated by humans. We refer readers to rizvi-etal-2021-gcmPratapa2018LanguageMF for a more detailed description of EC and ML theories, and their computational implementation for generating synthetic code-mixed sentences.
  • Figure 3: Screenshot of the annotation tool UI provided to the annotators
  • Figure 4: Distribution of Code-Mix Index (CMI), Number of switch points, and Burstiness of code-mix sentences in our dataset.
  • Figure 5: Distribution of $SyMCoM$ for OSN and GCM samples in our dataset. $SyMCoM$ score represents the syntactic variety in a code-mix sentence. Distribution of SyMCoM for OSN posts is usually skewed towards -1 (i.e only one language is contributing tokens for that PoS tag). SyMCoM for GCM is spread out for most PoS tags, indicating higher diversity in switching patterns of PoS tags.
  • ...and 4 more figures