ConCSE: Unified Contrastive Learning and Augmentation for Code-Switched Embeddings
Jangyeong Jeon, Sangyeon Cho, Minuk Ma, Junyoung Kim
TL;DR
This work tackles code-switching between English and Korean by introducing the Koglish dataset and a CS-aware embedding method, ConCSE. ConCSE jointly leverages CS augmentation with three losses—Cross Contrastive Loss, Cross Triplet Loss, and Align Negative Loss—to explicitly model cross-language semantics in code-switched sentences. Experimental results on Koglish-STS show ConCSE provides consistent improvements over SimCSE across backbones and CS tasks, with notable gains when training on CS data (CS2CS) and using CS augmentation. The approach demonstrates the value of tailored CS datasets and augmentation for robust multilingual sentence embeddings and scales to multiple languages and tasks, offering a path toward better handling of low-resource CS data in downstream NLP applications.
Abstract
This paper examines the Code-Switching (CS) phenomenon where two languages intertwine within a single utterance. There exists a noticeable need for research on the CS between English and Korean. We highlight that the current Equivalence Constraint (EC) theory for CS in other languages may only partially capture English-Korean CS complexities due to the intrinsic grammatical differences between the languages. We introduce a novel Koglish dataset tailored for English-Korean CS scenarios to mitigate such challenges. First, we constructed the Koglish-GLUE dataset to demonstrate the importance and need for CS datasets in various tasks. We found the differential outcomes of various foundation multilingual language models when trained on a monolingual versus a CS dataset. Motivated by this, we hypothesized that SimCSE, which has shown strengths in monolingual sentence embedding, would have limitations in CS scenarios. We construct a novel Koglish-NLI (Natural Language Inference) dataset using a CS augmentation-based approach to verify this. From this CS-augmented dataset Koglish-NLI, we propose a unified contrastive learning and augmentation method for code-switched embeddings, ConCSE, highlighting the semantics of CS sentences. Experimental results validate the proposed ConCSE with an average performance enhancement of 1.77\% on the Koglish-STS(Semantic Textual Similarity) tasks.
