LoRA Meets Dropout under a Unified Framework

Sheng Wang; Liheng Chen; Jiyue Jiang; Boyang Xue; Lingpeng Kong; Chuan Wu

LoRA Meets Dropout under a Unified Framework

Sheng Wang, Liheng Chen, Jiyue Jiang, Boyang Xue, Lingpeng Kong, Chuan Wu

TL;DR

The work addresses overfitting risks in LoRA-based parameter-efficient finetuning by systematically comparing transformer dropout methods. It reveals an equivalence between DropKey and DropAttention in forward passes, analyzes backpropagation differences, and proposes a unified framework along dropping position, structural pattern, and compensation measures. From this framework, the authors introduce HiddenKey, a novel dropout method that combines column-wise DropKey and element-wise HiddenCut with bidirectional KL loss, achieving superior performance across NLU, NLG, and LLM settings while being largely sufficient on its own. The findings provide practical guidance for dropout design in LoRA regimes, offering a robust, parameter-efficient approach for high-performance transformer finetuning with broad applicability in NLP tasks.

Abstract

With the remarkable capabilities, large language models (LLMs) have emerged as essential elements in numerous NLP applications, while parameter-efficient finetuning, especially LoRA, has gained popularity as a lightweight approach for model customization. Meanwhile, various dropout methods, initially designed for full finetuning with all the parameters updated, alleviates overfitting associated with excessive parameter redundancy. Hence, a possible contradiction arises from negligible trainable parameters of LoRA and the effectiveness of previous dropout methods, which has been largely overlooked. To fill this gap, we first confirm that parameter-efficient LoRA is also overfitting-prone. We then revisit transformer-specific dropout methods, and establish their equivalence and distinctions mathematically and empirically. Building upon this comparative analysis, we introduce a unified framework for a comprehensive investigation, which instantiates these methods based on dropping position, structural pattern and compensation measure. Through this framework, we reveal the new preferences and performance comparisons of them when involved with limited trainable parameters. This framework also allows us to amalgamate the most favorable aspects into a novel dropout method named HiddenKey. Extensive experiments verify the remarkable superiority and sufficiency of HiddenKey across multiple models and tasks, which highlights it as the preferred approach for high-performance and parameter-efficient finetuning of LLMs.

LoRA Meets Dropout under a Unified Framework

TL;DR

Abstract

Paper Structure (37 sections, 11 equations, 6 figures, 8 tables)

This paper contains 37 sections, 11 equations, 6 figures, 8 tables.

Introduction
Preliminaries
DropAttention.
DropKey.
HiddenCut.
Method
Mathematical and Empirical Comparison
Equivalent Forwarding between DropKey and DropAttention.
Variation in Back-Propagation between DropKey and DropAttention.
Comparison with HiddenCut.
A Unified Framework
Dropping Position.
Structural Pattern.
Compensation for Training and Inference Gap.
HiddenKey
...and 22 more sections

Figures (6)

Figure 1: Illustration of transformer architecture and typical transformer-specific dropout methods, namely DropKey, DropAttention, and HiddenCut.
Figure 2: Three structural sampling strategies, namely element, column, and span. The grey and blue cells represent masked and remaining entries, respectively. In HiddenCut, rows and columns denote sequence length ($L$) and hidden dimension ($D$), while representing keys ($K$) and queries ($Q$) in DropKey and DropAttention.
Figure 3: Illustration of HiddenKey. It respectively drops columns and elements of attention logits and hidden representations, and augments bidirectional KL loss to minimize the training and inference gap implicitly.
Figure 4: Performance of RoBERTa-large with different dropout methods on four NLU datasets, namely RTE, MRPC, SST-2 and STS-B. Markers and line styles differentiate various dropping positions, while the shades of color represent the structural patterns. Pearson correlation is reported for STS-B, while accuracy is utilized for others.
Figure 5: Evaluation accuracy of LoRA with respect to the rank on RTE dataset.
...and 1 more figures

LoRA Meets Dropout under a Unified Framework

TL;DR

Abstract

LoRA Meets Dropout under a Unified Framework

Authors

TL;DR

Abstract

Table of Contents

Figures (6)