Table of Contents
Fetching ...

SASFT: Sparse Autoencoder-guided Supervised Finetuning to Mitigate Unexpected Code-Switching in LLMs

Boyi Deng, Yu Wan, Baosong Yang, Fei Huang, Wenjie Wang, Fuli Feng

TL;DR

This paper provides an in-depth analysis of unexpected code-switching using sparse autoencoders and proposes SASFT, which teaches LLMs to maintain appropriate pre-activation values of specific language features during training, showing its effectiveness in addressing code-switching while preserving multilingual capabilities.

Abstract

Large Language Models (LLMs) have impressive multilingual capabilities, but they suffer from unexpected code-switching, also known as language mixing, which involves switching to unexpected languages in the model response. This problem leads to poor readability and degrades the usability of model responses. However, existing work on this issue lacks a mechanistic analysis and shows limited effectiveness. In this paper, we first provide an in-depth analysis of unexpected code-switching using sparse autoencoders and find that when LLMs switch to a language, the features of that language exhibit excessive pre-activation values. Based on our findings, we propose $\textbf{S}$parse $\textbf{A}$utoencoder-guided $\textbf{S}$upervised $\textbf{F}$ine$\textbf{t}$uning (SASFT), which teaches LLMs to maintain appropriate pre-activation values of specific language features during training. Experiments on five models across three languages demonstrate that SASFT consistently reduces unexpected code-switching by more than 50\% compared to standard supervised fine-tuning, with complete elimination in one case. Moreover, SASFT maintains or even improves the models' performance on six multilingual benchmarks, showing its effectiveness in addressing code-switching while preserving multilingual capabilities. The code and data are available at https://github.com/Aatrox103/SASFT.

SASFT: Sparse Autoencoder-guided Supervised Finetuning to Mitigate Unexpected Code-Switching in LLMs

TL;DR

This paper provides an in-depth analysis of unexpected code-switching using sparse autoencoders and proposes SASFT, which teaches LLMs to maintain appropriate pre-activation values of specific language features during training, showing its effectiveness in addressing code-switching while preserving multilingual capabilities.

Abstract

Large Language Models (LLMs) have impressive multilingual capabilities, but they suffer from unexpected code-switching, also known as language mixing, which involves switching to unexpected languages in the model response. This problem leads to poor readability and degrades the usability of model responses. However, existing work on this issue lacks a mechanistic analysis and shows limited effectiveness. In this paper, we first provide an in-depth analysis of unexpected code-switching using sparse autoencoders and find that when LLMs switch to a language, the features of that language exhibit excessive pre-activation values. Based on our findings, we propose parse utoencoder-guided upervised ineuning (SASFT), which teaches LLMs to maintain appropriate pre-activation values of specific language features during training. Experiments on five models across three languages demonstrate that SASFT consistently reduces unexpected code-switching by more than 50\% compared to standard supervised fine-tuning, with complete elimination in one case. Moreover, SASFT maintains or even improves the models' performance on six multilingual benchmarks, showing its effectiveness in addressing code-switching while preserving multilingual capabilities. The code and data are available at https://github.com/Aatrox103/SASFT.

Paper Structure

This paper contains 35 sections, 10 equations, 22 figures, 14 tables.

Figures (22)

  • Figure 1: Examples of unexpected code-switching to Chinese, Russian, and Korean.
  • Figure 2: The unexpected code-switching to Chinese for five LLMs in six languages. The results suggest that unexpected code-switching is a common issue in multilingual LLMs.
  • Figure 3: The average pre-activation values of the Chinese feature at different token positions across various LLMs. Position 0 represents the first token that switches to Chinese. Before code-switching occurs, the pre-activation values of the Chinese feature gradually increase.
  • Figure 4: The code-switching ratio to Chinese after ablating Chinese or English features with different $\lambda$. (1) Ablating the Chinese feature can reduce the unexpected code-switching ratio. (2) A higher coefficient $\lambda$ leads to better reduction in the unexpected code-switching ratio. (3) Ablating the English feature has little impact on the unexpected code-switching ratio to Chinese.
  • Figure 5: SASFT operates in two steps: First, it identifies language-specific features in LLMs (left), then leverages these features as training signals to reduce code-switching behavior (right).
  • ...and 17 more figures