X-PEFT: eXtremely Parameter-Efficient Fine-Tuning for Extreme Multi-Profile Scenarios
Namju Kwak, Taesup Kim
TL;DR
X-PEFT tackles extreme multi-profile NLP by dramatically reducing per-profile parameters and memory through learnable mask tensors that selectively compose a large pool of pre-trained adapters. It introduces soft-masked and hard-masked variants to fuse adapters without training new ones, and demonstrates strong performance on LaMP, GLUE, and SuperGLUE using both trained and random adapters, with memory reductions up to $10^4\times$ and parameter reductions around $10^2\times$. By framing the adapter selection as an adapter-level supermask problem, the approach aligns with the Lottery Ticket Hypothesis, showing that even random adapters can yield competitive results when masked appropriately. The proposed framework enables scalable, multi-profile NLP deployments with minimal per-profile storage, facilitating practical service at scale while maintaining high task performance.
Abstract
Parameter-efficient fine-tuning (PEFT) techniques, such as adapter tuning, aim to fine-tune a pre-trained language model (PLM) using a minimal number of parameters for a specific task or profile. Although adapter tuning provides increased parameter efficiency compared to full-model fine-tuning, it introduces a small set of additional parameters attached to a PLM for each profile. This can become problematic in practical applications with multiple profiles, particularly when a significant increase in the number of profiles linearly boosts the total number of additional parameters. To mitigate this issue, we introduce X-PEFT, a novel PEFT method that leverages a multitude of given adapters by fine-tuning an extremely small set of compact tensors for a new profile, which serve as binary masks to adaptively select the given adapters. To efficiently validate our proposed method, we implement it using a large number of trained or untrained (random) adapters. We evaluate the performance of X-PEFT through LaMP and GLUE tasks and demonstrate that it either matches or surpasses the effectiveness of conventional adapter tuning, despite reducing the memory requirements per profile by a factor of 10,000 compared to it.
