Table of Contents
Fetching ...

DeFTX: Denoised Sparse Fine-Tuning for Zero-Shot Cross-Lingual Transfer

Sona Elza Simon, Preethi Jyothi

TL;DR

DeFT-X tackles zero-shot cross-lingual transfer by adding a denoising step to sparse fine-tuning. It uses low-rank SVD to separate informative (low-rank) from noisy (high-order) components before magnitude pruning, producing higher-quality language- and task-specific subnetworks. The method outperforms strong baselines (LT-SFT and MAD-X) on extremely low-resource languages in NusaX and AmericasNLI, with ablations confirming the benefits of denoising, sparsity, and re-training. This denoised sparse-finetuning approach reduces interference between language and task vectors and offers a practical, scalable path for cross-lingual transfer in low-resource settings.

Abstract

Effective cross-lingual transfer remains a critical challenge in scaling the benefits of large language models from high-resource to low-resource languages. Towards this goal, prior studies have explored many approaches to combine task knowledge from task-specific data in a (high-resource) source language and language knowledge from unlabeled text in a (low-resource) target language. One notable approach proposed composable sparse fine-tuning (SFT) for cross-lingual transfer that learns task-specific and language-specific sparse masks to select a subset of the pretrained model's parameters that are further fine-tuned. These sparse fine-tuned vectors (SFTs) are subsequently composed with the pretrained model to facilitate zero-shot cross-lingual transfer to a task in a target language, using only task-specific data from a source language. These sparse masks for SFTs were identified using a simple magnitude-based pruning. In our work, we introduce DeFT-X, a novel composable SFT approach that denoises the weight matrices of a pretrained model before magnitude pruning using singular value decomposition, thus yielding more robust SFTs. We evaluate DeFT-X on a diverse set of extremely low-resource languages for sentiment classification (NusaX) and natural language inference (AmericasNLI) and demonstrate that it performs at par or outperforms SFT and other prominent cross-lingual transfer baselines.

DeFTX: Denoised Sparse Fine-Tuning for Zero-Shot Cross-Lingual Transfer

TL;DR

DeFT-X tackles zero-shot cross-lingual transfer by adding a denoising step to sparse fine-tuning. It uses low-rank SVD to separate informative (low-rank) from noisy (high-order) components before magnitude pruning, producing higher-quality language- and task-specific subnetworks. The method outperforms strong baselines (LT-SFT and MAD-X) on extremely low-resource languages in NusaX and AmericasNLI, with ablations confirming the benefits of denoising, sparsity, and re-training. This denoised sparse-finetuning approach reduces interference between language and task vectors and offers a practical, scalable path for cross-lingual transfer in low-resource settings.

Abstract

Effective cross-lingual transfer remains a critical challenge in scaling the benefits of large language models from high-resource to low-resource languages. Towards this goal, prior studies have explored many approaches to combine task knowledge from task-specific data in a (high-resource) source language and language knowledge from unlabeled text in a (low-resource) target language. One notable approach proposed composable sparse fine-tuning (SFT) for cross-lingual transfer that learns task-specific and language-specific sparse masks to select a subset of the pretrained model's parameters that are further fine-tuned. These sparse fine-tuned vectors (SFTs) are subsequently composed with the pretrained model to facilitate zero-shot cross-lingual transfer to a task in a target language, using only task-specific data from a source language. These sparse masks for SFTs were identified using a simple magnitude-based pruning. In our work, we introduce DeFT-X, a novel composable SFT approach that denoises the weight matrices of a pretrained model before magnitude pruning using singular value decomposition, thus yielding more robust SFTs. We evaluate DeFT-X on a diverse set of extremely low-resource languages for sentiment classification (NusaX) and natural language inference (AmericasNLI) and demonstrate that it performs at par or outperforms SFT and other prominent cross-lingual transfer baselines.

Paper Structure

This paper contains 33 sections, 5 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: A graphical representation of DeFT-X. The pretrained model $\theta$ (gray, left) undergoes full fine-tuning to obtain $\theta_{\text{FFT}}$. The difference $\Delta W$ (blue and red, left) captures the magnitude difference between $\theta$ and $\theta_{\text{FFT}}$. Each weight matrix in $\Delta W$ is denoised by pruning higher-order components (i.e., lower singular value components) while retaining lower-order components (i.e., high singular value components). The denoised $\Delta W$ is then magnitude-pruned and sparsely fine-tuned to produce $\phi$. Finally, the language-specific component $\phi_L$ and task-specific component $\phi_T$ are combined via addition to form the target language-task model $\theta_{\text{TL}}$(left).
  • Figure 2: Comparing the overlap between the sparse language vectors and its corresponding task vectors. For DeFT-X, we compare using $r_l$ = $r_t$ =100.
  • Figure 3: Overlap (in percentage) between the sparse language vectors of DeFT-X at $r_l$=100.