Table of Contents
Fetching ...

Conditioning LLMs to Generate Code-Switched Text

Maite Heredia, Gorka Labaka, Jeremy Barnes, Aitor Soroa

TL;DR

This work tackles code-switching generation by leveraging back-translation to create a pseudo-parallel EN-CS corpus (EN-CS) from LINCE English-Spanish CS data, enabling fine-tuning of autoregressive LLMs to convert monolingual English into natural CS. Through experiments comparing fine-tuned and few-shot baselines, the authors show that fine-tuned LLMs achieve higher human preference and display fewer CS-related errors than baselines, outperforming some proprietary models in CS generation. Comprehensive evaluations reveal a persistent mismatch between common automatic metrics (BLEU, BERTScore, chrF) and human judgments, while GPT-based judging correlates moderately, underscoring the need for CS-aware evaluation methods. The work contributes an openly released EN-CS dataset and a CS generation pipeline, with implications for multilingual NLP applications and future cross-lingual CS research, though it acknowledges limitations such as in-domain evaluation and the dependence on initial CS data.

Abstract

Code-switching (CS) is still a critical challenge in Natural Language Processing (NLP). Current Large Language Models (LLMs) struggle to interpret and generate code-switched text, primarily due to the scarcity of large-scale CS datasets for training. This paper presents a novel methodology to generate CS data using LLMs, and test it on the English-Spanish language pair. We propose back-translating natural CS sentences into monolingual English, and using the resulting parallel corpus to fine-tune LLMs to turn monolingual sentences into CS. Unlike previous approaches to CS generation, our methodology uses natural CS data as a starting point, allowing models to learn its natural distribution beyond grammatical patterns. We thoroughly analyse the models' performance through a study on human preferences, a qualitative error analysis and an evaluation with popular automatic metrics. Results show that our methodology generates fluent code-switched text, expanding research opportunities in CS communication, and that traditional metrics do not correlate with human judgement when assessing the quality of the generated CS data. We release our code and generated dataset under a CC-BY-NC-SA license.

Conditioning LLMs to Generate Code-Switched Text

TL;DR

This work tackles code-switching generation by leveraging back-translation to create a pseudo-parallel EN-CS corpus (EN-CS) from LINCE English-Spanish CS data, enabling fine-tuning of autoregressive LLMs to convert monolingual English into natural CS. Through experiments comparing fine-tuned and few-shot baselines, the authors show that fine-tuned LLMs achieve higher human preference and display fewer CS-related errors than baselines, outperforming some proprietary models in CS generation. Comprehensive evaluations reveal a persistent mismatch between common automatic metrics (BLEU, BERTScore, chrF) and human judgments, while GPT-based judging correlates moderately, underscoring the need for CS-aware evaluation methods. The work contributes an openly released EN-CS dataset and a CS generation pipeline, with implications for multilingual NLP applications and future cross-lingual CS research, though it acknowledges limitations such as in-domain evaluation and the dependence on initial CS data.

Abstract

Code-switching (CS) is still a critical challenge in Natural Language Processing (NLP). Current Large Language Models (LLMs) struggle to interpret and generate code-switched text, primarily due to the scarcity of large-scale CS datasets for training. This paper presents a novel methodology to generate CS data using LLMs, and test it on the English-Spanish language pair. We propose back-translating natural CS sentences into monolingual English, and using the resulting parallel corpus to fine-tune LLMs to turn monolingual sentences into CS. Unlike previous approaches to CS generation, our methodology uses natural CS data as a starting point, allowing models to learn its natural distribution beyond grammatical patterns. We thoroughly analyse the models' performance through a study on human preferences, a qualitative error analysis and an evaluation with popular automatic metrics. Results show that our methodology generates fluent code-switched text, expanding research opportunities in CS communication, and that traditional metrics do not correlate with human judgement when assessing the quality of the generated CS data. We release our code and generated dataset under a CC-BY-NC-SA license.

Paper Structure

This paper contains 28 sections, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Error distribution by model, obtained by counting the number of instances that present errors of each type.
  • Figure 2: Heatmap of the correlations between human scores and reference-based metrics and scores given by GPT, calculated using the Pearson Correlation Coefficient. The correlations are calculated for all instances, as well as for different subsets of instances, according to the type of errors they exhibit.
  • Figure 3: Distribution of Adequacy and Fluency scores per annotator.