ConCSE: Unified Contrastive Learning and Augmentation for Code-Switched Embeddings

Jangyeong Jeon; Sangyeon Cho; Minuk Ma; Junyoung Kim

ConCSE: Unified Contrastive Learning and Augmentation for Code-Switched Embeddings

Jangyeong Jeon, Sangyeon Cho, Minuk Ma, Junyoung Kim

TL;DR

This work tackles code-switching between English and Korean by introducing the Koglish dataset and a CS-aware embedding method, ConCSE. ConCSE jointly leverages CS augmentation with three losses—Cross Contrastive Loss, Cross Triplet Loss, and Align Negative Loss—to explicitly model cross-language semantics in code-switched sentences. Experimental results on Koglish-STS show ConCSE provides consistent improvements over SimCSE across backbones and CS tasks, with notable gains when training on CS data (CS2CS) and using CS augmentation. The approach demonstrates the value of tailored CS datasets and augmentation for robust multilingual sentence embeddings and scales to multiple languages and tasks, offering a path toward better handling of low-resource CS data in downstream NLP applications.

Abstract

This paper examines the Code-Switching (CS) phenomenon where two languages intertwine within a single utterance. There exists a noticeable need for research on the CS between English and Korean. We highlight that the current Equivalence Constraint (EC) theory for CS in other languages may only partially capture English-Korean CS complexities due to the intrinsic grammatical differences between the languages. We introduce a novel Koglish dataset tailored for English-Korean CS scenarios to mitigate such challenges. First, we constructed the Koglish-GLUE dataset to demonstrate the importance and need for CS datasets in various tasks. We found the differential outcomes of various foundation multilingual language models when trained on a monolingual versus a CS dataset. Motivated by this, we hypothesized that SimCSE, which has shown strengths in monolingual sentence embedding, would have limitations in CS scenarios. We construct a novel Koglish-NLI (Natural Language Inference) dataset using a CS augmentation-based approach to verify this. From this CS-augmented dataset Koglish-NLI, we propose a unified contrastive learning and augmentation method for code-switched embeddings, ConCSE, highlighting the semantics of CS sentences. Experimental results validate the proposed ConCSE with an average performance enhancement of 1.77\% on the Koglish-STS(Semantic Textual Similarity) tasks.

ConCSE: Unified Contrastive Learning and Augmentation for Code-Switched Embeddings

TL;DR

Abstract

Paper Structure (31 sections, 9 equations, 3 figures, 9 tables)

This paper contains 31 sections, 9 equations, 3 figures, 9 tables.

Introduction
Related Work
Theoretical Foundations of Code-Switching
Representation Learning
Deep Metric Learning
Contrastive Learning
Proposed Dataset: Koglish
Code-switching Patterns and Dataset Construction
Constructing Koglish Dataset
Proposed Method: ConCSE
Cross Contrastive Loss
Cross Triplet Loss
Align Negative Loss
Experiments
Experiments on Koglish: The Role of Koglish in Code-Switching Scenario
...and 16 more sections

Figures (3)

Figure 1: Schematic of the parse tree, based on constituency parsing, to convert a monolingual sentence into an English-Korean code-switched sentence.
Figure 2: Systematic Approach to Constructing and Augmenting the Koglish Dataset. Constituency parser extracts nouns or noun phrases(NP) using the Google Translate API. In this case, GLUE and STS datasets are generated as CS datasets, and NLI datasets are CS-augmented.
Figure 3: Overview of ConCSE. A mini-batch contains both $\mathcal{D}_{en} = \{x_i, x_{i}^{+}, x_{i}^{-}\}_{i=1}^{m}$ and CS-augmented $\mathcal{D}_{cs} = \{\hat{x}_i, \hat{x}_{i}^{+}, \hat{x}_{i}^{-}\}_{i=1}^{m}$, and its hidden representations are $H = \{h_{i},h_{i}^{+},h_{i}^{-}\}_{i=1}^{N}$ and $\hat{H} = \{\hat{h}_{i},\hat{h}_{i}^{+},\hat{h}_{i}^{-}\}_{i=1}^{N}$. They are processed by the sentence encoder $\mathcal{M}_{\phi}$, producing "[CLS]" as the final sentence representation. The "[CLS]" of the multi-positive group, comprising monolingual sentences ($h_i, h_{i}^{+}$) and CS sentences ($\hat{h}_i, \hat{h}_{i}^{+}$), should be attracted to each other. Similarly, the "[CLS]" of the multi-negative pair, comprising a monolingual sentence ($h_{i}^{-}$) and CS sentence ($\hat{h}_{i}^{-}$), should also be attracted to each other. Moreover, multi-positive groups and multi-negative pairs should push each other.

ConCSE: Unified Contrastive Learning and Augmentation for Code-Switched Embeddings

TL;DR

Abstract

ConCSE: Unified Contrastive Learning and Augmentation for Code-Switched Embeddings

Authors

TL;DR

Abstract

Table of Contents

Figures (3)