KS-Lottery: Finding Certified Lottery Tickets for Multilingual Language Models

Fei Yuan; Chang Ma; Shuai Yuan; Qiushi Sun; Lei Li

KS-Lottery: Finding Certified Lottery Tickets for Multilingual Language Models

Fei Yuan, Chang Ma, Shuai Yuan, Qiushi Sun, Lei Li

TL;DR

The paper addresses the problem of achieving full fine-tuning performance with ultra-small subnetworks in multilingual LLMs. It introduces KS-Lottery, which uses the Kolmogorov-Smirnov Test to detect distribution shifts in embedding parameters during fine-tuning and identifies certifiable winning tickets within the embedding layer. The method demonstrates that as few as 18 token embeddings can deliver translation quality comparable to full fine-tuning on multilingual benchmarks, and provides a theoretical certification guaranteeing performance under defined distribution-distance conditions. Empirically, KS-Lottery outperforms several parameter-efficient tuning approaches in terms of parameter efficiency and interpretability, across bilingual translation tasks and LLaMA-7B, with strong generalization to Partial Tuning and Partial Transfer scenarios. This work offers a principled, certifiable pathway to efficient multilingual transfer with potential broad applicability beyond translation tasks.

Abstract

The lottery ticket hypothesis posits the existence of ``winning tickets'' within a randomly initialized neural network. Do winning tickets exist for LLMs in fine-tuning scenarios? How can we find such winning tickets? In this paper, we propose KS-Lottery, a method to identify a small subset of LLM parameters highly effective in multilingual fine-tuning. Our key idea is to use Kolmogorov-Smirnov Test to analyze the distribution shift of parameters before and after fine-tuning. We further theoretically prove that KS-Lottery can find the certified winning tickets in the embedding layer, fine-tuning on the found parameters is guaranteed to perform as well as full fine-tuning. Comparing KS-Lottery with other parameter-efficient tuning algorithms on translation tasks, the experimental results show that KS-Lottery finds a much smaller set of parameters for fine-tuning while achieving the comparable performance as full fine-tuning LLM. Surprisingly, we find that fine-tuning 18 tokens' embedding of LLaMA suffices to reach the fine-tuning translation performance~\footnote{https://github.com/CONE-MT/KS-Lottery.}.

KS-Lottery: Finding Certified Lottery Tickets for Multilingual Language Models

TL;DR

Abstract

Paper Structure (20 sections, 2 theorems, 6 equations, 6 figures, 11 tables, 1 algorithm)

This paper contains 20 sections, 2 theorems, 6 equations, 6 figures, 11 tables, 1 algorithm.

Introduction
Related Work
Certified Winning Tickets via KS-Lottery
Embed Tuning is effective for multilingual transfer.
Kolmogorov-Smirnov Test
Finding 1: KS-Lottery finds certifiable winning tickets within the embedding layer.
Finding 2: 18 identified winning tickets (18 tokens) achieves remarkable performance.
Analysis
KS-Lottery Certification
KS-Lottery Efficiency and Interpretability
Sensitivity in the siginificance level Selection ($\alpha$).
Conclusion
Limitations and Broader Impacts
Limitations.
Broader Impacts.
...and 5 more sections

Key Result

Theorem 1

(Kolmogorov-Smirnov Test, ks-test) The test statistic for this Kolmogorov-Smirnov Test can be defined in terms of two hypotheses: $H_0$: $\theta_i$ and $\widetilde{\theta_i}$ come from the same distribution. $H_1$: the two samples are not from the same distribution. If test T: $D_i >\tau(\alpha)$ i

Figures (6)

Figure 1: (a): KS-Lottery identifies a small subset of embedding parameters of LLaMA-7B to maintain the translation performance of en$\rightarrow$ca on Flores-101. (b): KS-Lottery consists of two steps: (1) finding the winning tickets in the embedding layer by Kolmogorov-Smirnov Test; (2) one way to use these winning tickets is partial tuning these tokens ensuring other parameters keep frozen.
Figure 2: Illustration of $f(\cdot, \boldsymbol{\theta}_a^E, \boldsymbol{\theta}_b^E)$ in 2 dimensions. Left: The concentric circles are the density contours of embedding parameters before and after fine-tuning, and the colored landscape is the decision boundaries of $f(\cdot)$. Right: the distribution $\mathbb{P}\left[f(x,\theta_a^E,\widetilde{\theta_b^E})\right]$ and $\mathbb{P}\left[f(x,\widetilde{\theta_a^E},\theta_b^E)\right]$. $\underline{p_A}$ is the probability $\mathbb{P}\left[f(x,\widetilde{\theta}_a^E,\widetilde{\theta_b^E})\right]$ predicts $x$ to be token $c_A$ (color blue), and $\overline{p_B}$ as the probability of second most likely prediction (color red). $D_{ks}$ denotes the Kolmogrov-Smirnov distance between distributions before and after tuning. We choose the set of token embeddings for fine-tuning as those with little distribution overlap before and after fine-tuning, which may be critical to prediction.
Figure 3: Certified experiment under Partial Tuning setting. A: Estimation of $\tau(\alpha)$ w.r.t different $\alpha$ by running Kolmogorov-Smirnov Test between the distribution of LLaMA-7B embedding and fine-tuned embedding on different datasets. B C D: Comparison between Certified Accuracy and Empirical Prediction Accuracy w.r.t. different $\alpha$ on 3 datasets. More results are shown in Appendix \ref{['appendix:certified']}.
Figure 4: (a): When selective tokens are restricted from being updated, the model’s fine-tuning process for downstream tasks loses its effectiveness. (b): Apply KS-Lottery on whole LLaMA-7B which is single-layer fine-tuned on Lego-MT yuan-etal-2023-lego en$\rightarrow$ca 10k data. Each layer is trained in isolation and is analyzed by KS-Lottery to identify the parameters with significant changes (as indicated by scatter points above the red line). We find that within each Transformer layer, changes are primarily focused on LayerNorm, while other notable changes occur in the embedding layer.
Figure 5: Certified experiment under Partial Transfer setting. A: Estimation of $\tau(\alpha)$ w.r.t different $\alpha$ values by running KS-Test between the distribution of LLaMA-7b embedding and fine-tuned embedding on different datasets. B C D: Comparison between Certified Accuracy and Empirical Prediction Accuracy w.r.t. different $\alpha$ values on three datasets.
...and 1 more figures

Theorems & Definitions (5)

Theorem 1
Theorem 2
Remark 3
Remark 4
Remark 5

KS-Lottery: Finding Certified Lottery Tickets for Multilingual Language Models

TL;DR

Abstract

KS-Lottery: Finding Certified Lottery Tickets for Multilingual Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (5)