Table of Contents
Fetching ...

DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language Models

Xuxi Chen, Tianlong Chen, Weizhu Chen, Ahmed Hassan Awadallah, Zhangyang Wang, Yu Cheng

TL;DR

<3-5 sentence high-level summary> DSEE tackles the dual inefficiencies of fine-tuning and deploying gigantic pre-trained language models by enforcing sparsity in both the weight updates and the final model weights. It combines a sparsity-aware low-rank update with a sparse residual, and derives final sparse masks from the update itself to achieve inference efficiency without extra sparsification costs. The framework supports both unstructured and structured sparsity and shows strong parameter savings (e.g., around 0.5% trainable parameters on BERT) and noticeable FLOP reductions (≈25%) while maintaining competitive downstream performance across BERT, RoBERTa, and GPT-2. This work enables scalable, cost-effective deployment of large PLMs in resource-constrained settings and offers a versatile toolkit for further efficiency-oriented research.</string>

Abstract

Gigantic pre-trained models have become central to natural language processing (NLP), serving as the starting point for fine-tuning towards a range of downstream tasks. However, two pain points persist for this paradigm: (a) as the pre-trained models grow bigger (e.g., 175B parameters for GPT-3), even the fine-tuning process can be time-consuming and computationally expensive; (b) the fine-tuned model has the same size as its starting point by default, which is neither sensible due to its more specialized functionality, nor practical since many fine-tuned models will be deployed in resource-constrained environments. To address these pain points, we propose a framework for resource- and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights. Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two key objectives: (i) parameter efficient fine-tuning - by enforcing sparsity-aware low-rank updates on top of the pre-trained weights; and (ii) resource-efficient inference - by encouraging a sparse weight structure towards the final fine-tuned model. We leverage sparsity in these two directions by exploiting both unstructured and structured sparse patterns in pre-trained language models via a unified approach. Extensive experiments and in-depth investigations, with diverse network backbones (i.e., BERT, RoBERTa, and GPT-2) on dozens of datasets, consistently demonstrate impressive parameter-/inference-efficiency, while maintaining competitive downstream performance. For instance, DSEE saves about 25% inference FLOPs while achieving comparable performance, with 0.5% trainable parameters on BERT. Codes are available in https://github.com/VITA-Group/DSEE.

DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language Models

TL;DR

<3-5 sentence high-level summary> DSEE tackles the dual inefficiencies of fine-tuning and deploying gigantic pre-trained language models by enforcing sparsity in both the weight updates and the final model weights. It combines a sparsity-aware low-rank update with a sparse residual, and derives final sparse masks from the update itself to achieve inference efficiency without extra sparsification costs. The framework supports both unstructured and structured sparsity and shows strong parameter savings (e.g., around 0.5% trainable parameters on BERT) and noticeable FLOP reductions (≈25%) while maintaining competitive downstream performance across BERT, RoBERTa, and GPT-2. This work enables scalable, cost-effective deployment of large PLMs in resource-constrained settings and offers a versatile toolkit for further efficiency-oriented research.</string>

Abstract

Gigantic pre-trained models have become central to natural language processing (NLP), serving as the starting point for fine-tuning towards a range of downstream tasks. However, two pain points persist for this paradigm: (a) as the pre-trained models grow bigger (e.g., 175B parameters for GPT-3), even the fine-tuning process can be time-consuming and computationally expensive; (b) the fine-tuned model has the same size as its starting point by default, which is neither sensible due to its more specialized functionality, nor practical since many fine-tuned models will be deployed in resource-constrained environments. To address these pain points, we propose a framework for resource- and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights. Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two key objectives: (i) parameter efficient fine-tuning - by enforcing sparsity-aware low-rank updates on top of the pre-trained weights; and (ii) resource-efficient inference - by encouraging a sparse weight structure towards the final fine-tuned model. We leverage sparsity in these two directions by exploiting both unstructured and structured sparse patterns in pre-trained language models via a unified approach. Extensive experiments and in-depth investigations, with diverse network backbones (i.e., BERT, RoBERTa, and GPT-2) on dozens of datasets, consistently demonstrate impressive parameter-/inference-efficiency, while maintaining competitive downstream performance. For instance, DSEE saves about 25% inference FLOPs while achieving comparable performance, with 0.5% trainable parameters on BERT. Codes are available in https://github.com/VITA-Group/DSEE.

Paper Structure

This paper contains 32 sections, 2 equations, 2 figures, 14 tables, 2 algorithms.

Figures (2)

  • Figure 1: The overview of our proposal. The sparse masks can have unstructured or structured patterns, which leads to resources efficiency. During the fine-tuning, we only train decomposed matrices $\mathcal{U}$, $\mathcal{V}$ and non-zero elements in $\mathcal{S}_2$.
  • Figure 2: Testing performance on SST-2 with different sizes of $\Omega$. We report the average accuracy and the $90\%$ confidence interval of five runs.