Table of Contents
Fetching ...

Inducing Generalization across Languages and Tasks using Featurized Low-Rank Mixtures

Chu-Cheng Lin, Xinyi Wang, Jonathan H. Clark, Han Lu, Yun Zhu, Chenxi Whitehouse, Hongkun Yu

TL;DR

FLix introduces Featurized Low-Rank Mixtures to address negative interference in parameter-efficient fine-tuning when adapting LLMs to many tasks and languages. By associating a distinct low-rank adapter with each dataset feature (task or language) and composing them per input, FLix enables targeted, sparse updates that generalize better to unseen task-language pairs while keeping compute low. Empirical results on XTREME-UP benchmarks show FLix outperforms standard LoRA baselines across multitask multilingual tuning and zero-shot scenarios, with strong gains in cross-lingual QA and semantic parsing. The approach leverages feature dropout and rank-aware adapters to balance transfer between datasets and scale with increasing data diversity, offering a practical path for efficient, large-scale multilingual adaptation.

Abstract

Adapting pretrained large language models (LLMs) to various downstream tasks in tens or hundreds of human languages is computationally expensive. Parameter-efficient fine-tuning (PEFT) significantly reduces the adaptation cost, by tuning only a small amount of parameters. However, common PEFT methods LoRA (Hu et al., 2022) suffer from suboptimal performance on diverse dataset mixtures, due to aggressive parameter tying and negative interference among different datasets. In this work, we propose Featurized Low-rank Mixtures (FLix), a novel PEFT method designed for effective multitask multilingual adaptation. FLix associates each unique dataset feature, such as the dataset's language or task, with its own low-rank weight update parameters. By composing feature-specific parameters for each dataset, FLix can accommodate diverse dataset mixtures and generalize better to unseen datasets. Our experiments show that FLix leads to significant improvements over a variety of tasks for both supervised learning and zero-shot settings with gains of up to $14.2$ inexact match points in zero-shot semantic parsing.

Inducing Generalization across Languages and Tasks using Featurized Low-Rank Mixtures

TL;DR

FLix introduces Featurized Low-Rank Mixtures to address negative interference in parameter-efficient fine-tuning when adapting LLMs to many tasks and languages. By associating a distinct low-rank adapter with each dataset feature (task or language) and composing them per input, FLix enables targeted, sparse updates that generalize better to unseen task-language pairs while keeping compute low. Empirical results on XTREME-UP benchmarks show FLix outperforms standard LoRA baselines across multitask multilingual tuning and zero-shot scenarios, with strong gains in cross-lingual QA and semantic parsing. The approach leverages feature dropout and rank-aware adapters to balance transfer between datasets and scale with increasing data diversity, offering a practical path for efficient, large-scale multilingual adaptation.

Abstract

Adapting pretrained large language models (LLMs) to various downstream tasks in tens or hundreds of human languages is computationally expensive. Parameter-efficient fine-tuning (PEFT) significantly reduces the adaptation cost, by tuning only a small amount of parameters. However, common PEFT methods LoRA (Hu et al., 2022) suffer from suboptimal performance on diverse dataset mixtures, due to aggressive parameter tying and negative interference among different datasets. In this work, we propose Featurized Low-rank Mixtures (FLix), a novel PEFT method designed for effective multitask multilingual adaptation. FLix associates each unique dataset feature, such as the dataset's language or task, with its own low-rank weight update parameters. By composing feature-specific parameters for each dataset, FLix can accommodate diverse dataset mixtures and generalize better to unseen datasets. Our experiments show that FLix leads to significant improvements over a variety of tasks for both supervised learning and zero-shot settings with gains of up to inexact match points in zero-shot semantic parsing.
Paper Structure (66 sections, 2 equations, 2 figures, 49 tables)

This paper contains 66 sections, 2 equations, 2 figures, 49 tables.

Figures (2)

  • Figure 1: Training and inference with FLix.
  • Figure 2: Difference in performance between multitask multilingual tuning and multilingual tuning only. While using the more diverse multilingual multitask data mixture leads to large performance drop for vanilla LoRA methods, FLix generally maintains or slightly improves the task performance with more diverse data mixtures.