GAST: Gradient-aligned Sparse Tuning of Large Language Models with Data-layer Selection

Kai Yao; Zhenghan Song; Kaixin Wu; Mingjie Zhong; Danzhao Cheng; Zhaorui Tan; Yixin Ji; Penglei Gao

GAST: Gradient-aligned Sparse Tuning of Large Language Models with Data-layer Selection

Kai Yao, Zhenghan Song, Kaixin Wu, Mingjie Zhong, Danzhao Cheng, Zhaorui Tan, Yixin Ji, Penglei Gao

TL;DR

GAST is proposed, an innovative method that simultaneously performs selective fine-tuning at both data and layer dimensions as integral components of a unified optimization strategy, providing a more comprehensive and sophisticated solution than approaches restricted to a single dimension.

Abstract

Parameter-Efficient Fine-Tuning (PEFT) has become a key strategy for adapting large language models, with recent advances in sparse tuning reducing overhead by selectively updating key parameters or subsets of data. Existing approaches generally focus on two distinct paradigms: layer-selective methods aiming to fine-tune critical layers to minimize computational load, and data-selective methods aiming to select effective training subsets to boost training. However, current methods typically overlook the fact that different data points contribute varying degrees to distinct model layers, and they often discard potentially valuable information from data perceived as of low quality. To address these limitations, we propose Gradient-aligned Sparse Tuning (GAST), an innovative method that simultaneously performs selective fine-tuning at both data and layer dimensions as integral components of a unified optimization strategy. GAST specifically targets redundancy in information by employing a layer-sparse strategy that adaptively selects the most impactful data points for each layer, providing a more comprehensive and sophisticated solution than approaches restricted to a single dimension. Experiments demonstrate that GAST consistently outperforms baseline methods, establishing a promising direction for future research in PEFT strategies.

GAST: Gradient-aligned Sparse Tuning of Large Language Models with Data-layer Selection

TL;DR

Abstract

Paper Structure (30 sections, 2 theorems, 19 equations, 6 figures, 9 tables, 1 algorithm)

This paper contains 30 sections, 2 theorems, 19 equations, 6 figures, 9 tables, 1 algorithm.

Introduction
Related Work
Parameter-efficient Fine-tuning
Layer-wise Sparse Tuning
Data-wise Sparse Tuning
Method
Theoretical Motivation
Gradient-aligned Sparse Tuning
Experiments
Experimental Setup
Models.
Datasets.
Implementation Details.
Baselines
Comparison with Adaptive Methods
...and 15 more sections

Key Result

Lemma 1

Let $\ell(\Delta)$ be an $L$-smooth objective with respect to $\Delta$. At iteration $t$, let $\Delta_t$ denote the current parameters and let $g_t$ be a stochastic gradient estimator satisfying With step size $\eta_t > 0$, the conditional expectation of the loss satisfies Here $C>0$ is the Lipschitz constant. In particular, for fixed $\eta_t$ and bounded $\mathbb{E}[\|g_t\|^{2} \mid \Delta_t]$,

Figures (6)

Figure 1: Difference among (a) layer-selective methods, (b) data-selective methods, and (c) our method. layer-selective methods generate a subset of all layers to be updated with all mini-batch data. Data-selective methods utilize partial mini-batch data to train all layers. Our GAST selects a different subset of data for each layer.
Figure 2: Overall of our proposed Gradient-aligned Sparse Tuning (GAST). During training, every mini-batch exhibits gradient conflicts both among data samples and across model layers. To mitigate these conflicts, GAST uses the gradient of the support set to decide which individual sample should be used to update each layer. This data-layer selection reduces gradient interference and thereby improves both convergence speed and generalization performance.
Figure 3: Comparison of model convergence with loss curve.
Figure 4: Impact of data-layer sparsity in GAST.
Figure 5: (a) Visualization of the sampling probability of mini-batch data points across layers on two different iterations. A deeper red color indicates a higher probability. (b) Distribution of the number of layers each data point is trained on within a mini-batch.
...and 1 more figures

Theorems & Definitions (2)

Lemma 1: L-Smoothness
Theorem 1: Total Differential.

GAST: Gradient-aligned Sparse Tuning of Large Language Models with Data-layer Selection

TL;DR

Abstract

GAST: Gradient-aligned Sparse Tuning of Large Language Models with Data-layer Selection

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (2)