Differentially Private Bias-Term Fine-tuning of Foundation Models

Zhiqi Bu; Yu-Xiang Wang; Sheng Zha; George Karypis

Differentially Private Bias-Term Fine-tuning of Foundation Models

Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, George Karypis

TL;DR

The paper tackles the challenge of privately fine-tuning large foundation models by introducing DP-BiTFiT, which privately tunes only bias terms and avoids storing activations, yielding considerable improvements in computation and memory over DP full fine-tuning. By leveraging activation-free forward passes and per-sample bias gradients, DP-BiTFiT remains model-agnostic and parameter-efficient (about 0.1% trainable parameters) while delivering competitive or superior accuracy under DP constraints, across language, generation, and vision tasks. The authors provide a thorough complexity and scalability analysis, demonstrate substantial speedups and memory savings, and open-source their FastDP implementation. The method enables DP fine-tuning on long sequences and high-resolution images that were previously challenging, broadening the practical deployment of privacy-preserving fine-tuning for large models.

Abstract

We study the problem of differentially private (DP) fine-tuning of large pre-trained models -- a recent privacy-preserving approach suitable for solving downstream tasks with sensitive data. Existing work has demonstrated that high accuracy is possible under strong privacy constraint, yet requires significant computational overhead or modifications to the network architecture. We propose differentially private bias-term fine-tuning (DP-BiTFiT), which matches the state-of-the-art accuracy for DP algorithms and the efficiency of the standard BiTFiT. DP-BiTFiT is model agnostic (not modifying the network architecture), parameter efficient (only training about 0.1% of the parameters), and computation efficient (almost removing the overhead caused by DP, in both the time and space complexity). On a wide range of tasks, DP-BiTFiT is 2~30X faster and uses 2~8X less memory than DP full fine-tuning, even faster than the standard full fine-tuning. This amazing efficiency enables us to conduct DP fine-tuning on language and vision tasks with long-sequence texts and high-resolution images, which were computationally difficult using existing methods. We open-source our code at FastDP (https://github.com/awslabs/fast-differential-privacy).

Differentially Private Bias-Term Fine-tuning of Foundation Models

TL;DR

Abstract

Paper Structure (30 sections, 10 equations, 5 figures, 17 tables, 1 algorithm)

This paper contains 30 sections, 10 equations, 5 figures, 17 tables, 1 algorithm.

Introduction
Preliminaries
Differentially private Bias-Term Fine-Tuning
Parameter efficiency
Complexity of weight and bias training
Scalability of DP algorithms
Efficiency v.s. feature dimension
Efficiency v.s. model size
Applicability of DP-BiTFiT
Experiments
Text classification
Natural Language Generation
Image classification
Discussion
Detailed analysis
...and 15 more sections

Figures (5)

Figure 1: Performance of different fine-tuning methods on MNLI dataset with RoBERTa-large. DP-BiTFiT is one of the most accurate (below DP LoRA marginally), fastest (only slower than DP Adapter), and memory efficient (outperforming others substantially by $3\times$) DP methods.
Figure 2: Back-propagation for DP (red&black) and non-DP (black) algorithms. Note that the bias gradient uses a much simpler computation graph than the weight gradient, rendering DP-BiTFiT easy-to-implement and efficient-to-compute. Left: full fine-tuning with GhostClip (ghost clipping; goodfellow2015efficientli2021largebu2022scalable). Upper right: full fine-tuning with Opacus opacus. Lower right: DP-BiTFiT.
Figure 3: Memory and speed by different fine-tuning methods. Top two: SST2 dataset (sequence length $T$; MixGhostClip is equivalent to GhostClip for this small $T$), RoBERTa-base and batch size 20. Bottom two: 50000 images of $\sqrt{T}\times\sqrt{T}$ pixels, ResNet50 and batch size 200.
Figure 4: Maximum throughput and batch size by different fine-tuning methods. Each model is represented by one column, which sorts the model size in decreasing order from left to right. Top two: E2E dataset with GPT2-small/medium/large (MixGhostClip is equivalent to GhostClip for this small $T$). Bottom two: 50000 images of $512\times 512$ pixels with ResNet 50/101/152.
Figure 5: Accuracy of DP ViT-large on CIFAR100.

Theorems & Definitions (2)

Definition 2.1: dwork2006calibrating
Remark 4.1

Differentially Private Bias-Term Fine-tuning of Foundation Models

TL;DR

Abstract

Differentially Private Bias-Term Fine-tuning of Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)

Theorems & Definitions (2)