Code-switching in text and speech reveals information-theoretic audience design
Debasmita Bhattacharya, Marten van Schijndel
TL;DR
This work investigates whether code-switching in Chinese–English text and spontaneous speech is driven not only by speaker access but also by audience-driven signaling of information load. Using bilingual corpora and language-model-based surprisal measures, the authors show that code-switching tends to occur at regions of high information load in the primary language, and crucially that the secondary-language continuations (CS1 English) often exhibit even higher information load and linguistic complexity than their monolingual equivalents. The study demonstrates that CS1 English is longer, less frequent, and more surprising than monolingual English, and that these patterns persist across writing and speech, with some modality-dependent differences in the magnitude of effects. By combining 5-gram models, neural LMs, and regression analyses, the paper provides strong evidence for information-theoretic audience-driven pressures shaping code-switching, suggesting that code-switching functions at the level of the communication channel to signal processing difficulty. The findings have implications for sociolinguistics, language modeling, and multilingual communication research, and establish a scalable methodology for analyzing audience design in code-switching across languages and modalities.
Abstract
In this work, we use language modeling to investigate the factors that influence code-switching. Code-switching occurs when a speaker alternates between one language variety (the primary language) and another (the secondary language), and is widely observed in multilingual contexts. Recent work has shown that code-switching is often correlated with areas of high information load in the primary language, but it is unclear whether high primary language load only makes the secondary language relatively easier to produce at code-switching points (speaker-driven code-switching), or whether code-switching is additionally used by speakers to signal the need for greater attention on the part of listeners (audience-driven code-switching). In this paper, we use bilingual Chinese-English online forum posts and transcripts of spontaneous Chinese-English speech to replicate prior findings that high primary language (Chinese) information load is correlated with switches to the secondary language (English). We then demonstrate that the information load of the English productions is even higher than that of meaning equivalent Chinese alternatives, and these are therefore not easier to produce, providing evidence of audience-driven influences in code-switching at the level of the communication channel, not just at the sociolinguistic level, in both writing and speech.
