Table of Contents
Fetching ...

A Survey of Code-switched Speech and Language Processing

Sunayana Sitaram, Khyathi Raghavi Chandu, Sai Krishna Rallabandi, Alan W Black

TL;DR

This survey addresses code-switching in speech and NLP, outlining why mixed-language input is prevalent and challenging for technology. It catalogs data resources across speech and text, reviews a wide range of task-specific techniques (ASR, LM, LID, NER, POS, parsing, QA, MT, dialogue, etc.), and notes the role of linguistic theory in guiding model design. It also highlights shared tasks and benchmarks (e.g., GLUECoS, LINCE) and assesses the performance of multilingual models, identifying gaps—especially in higher-level tasks like sentiment, QA, and NLI—under code-switching. The paper concludes with challenges and directions, advocating synthetic data, transfer learning, sociolinguistic-aware modeling, and broader end-to-end code-switching systems as key avenues for progress.

Abstract

Code-switching, the alternation of languages within a conversation or utterance, is a common communicative phenomenon that occurs in multilingual communities across the world. This survey reviews computational approaches for code-switched Speech and Natural Language Processing. We motivate why processing code-switched text and speech is essential for building intelligent agents and systems that interact with users in multilingual communities. As code-switching data and resources are scarce, we list what is available in various code-switched language pairs with the language processing tasks they can be used for. We review code-switching research in various Speech and NLP applications, including language processing tools and end-to-end systems. We conclude with future directions and open problems in the field.

A Survey of Code-switched Speech and Language Processing

TL;DR

This survey addresses code-switching in speech and NLP, outlining why mixed-language input is prevalent and challenging for technology. It catalogs data resources across speech and text, reviews a wide range of task-specific techniques (ASR, LM, LID, NER, POS, parsing, QA, MT, dialogue, etc.), and notes the role of linguistic theory in guiding model design. It also highlights shared tasks and benchmarks (e.g., GLUECoS, LINCE) and assesses the performance of multilingual models, identifying gaps—especially in higher-level tasks like sentiment, QA, and NLI—under code-switching. The paper concludes with challenges and directions, advocating synthetic data, transfer learning, sociolinguistic-aware modeling, and broader end-to-end code-switching systems as key avenues for progress.

Abstract

Code-switching, the alternation of languages within a conversation or utterance, is a common communicative phenomenon that occurs in multilingual communities across the world. This survey reviews computational approaches for code-switched Speech and Natural Language Processing. We motivate why processing code-switched text and speech is essential for building intelligent agents and systems that interact with users in multilingual communities. As code-switching data and resources are scarce, we list what is available in various code-switched language pairs with the language processing tasks they can be used for. We review code-switching research in various Speech and NLP applications, including language processing tools and end-to-end systems. We conclude with future directions and open problems in the field.

Paper Structure

This paper contains 36 sections.