Table of Contents
Fetching ...

PaCA: Partial Connection Adaptation for Efficient Fine-Tuning

Sunghyeon Woo, Sol Namkung, Sunwoo Lee, Inho Jeong, Beomseok Kim, Dongsuk Jeon

TL;DR

PaCA tackles the training-time and activation-memory bottlenecks of parameter-efficient fine-tuning by updating randomly selected partial connections within pretrained weights, avoiding the overhead of adapter layers. The approach integrates into standard forward/backward passes, and a Lipschitz-gradient-based convergence guarantee supports its theoretical soundness. Empirically, PaCA outperforms LoRA-style methods in training efficiency across MMLU and MT-Bench tasks and remains compatible with 4-bit quantization (QPaCA), enabling fine-tuning of extremely large models. The results show improved throughput and longer usable sequence lengths across CPU/GPU backends, and the method generalizes to vision transformers and CNNs, making PaCA a versatile alternative to existing PEFT schemes with practical impact for scalable fine-tuning.

Abstract

Prior parameter-efficient fine-tuning (PEFT) algorithms reduce memory usage and computational costs of fine-tuning large neural network models by training only a few additional adapter parameters, rather than the entire model. However, the reduction in computational costs due to PEFT does not necessarily translate to a reduction in training time; although the computational costs of the adapter layers are much smaller than the pretrained layers, it is well known that those two types of layers are processed sequentially on GPUs, resulting in significant latency overhead. LoRA and its variants merge low-rank adapter matrices with pretrained weights during inference to avoid latency overhead, but during training, the pretrained weights remain frozen while the adapter matrices are continuously updated, preventing such merging. To mitigate this issue, we propose Partial Connection Adaptation (PaCA), which fine-tunes randomly selected partial connections within the pretrained weights instead of introducing adapter layers in the model. PaCA not only enhances training speed by eliminating the time overhead due to the sequential processing of the adapter and pretrained layers but also reduces activation memory since only partial activations, rather than full activations, need to be stored for gradient computation. Compared to LoRA, PaCA reduces training time by 22% and total memory usage by 16%, while maintaining comparable accuracy across various fine-tuning scenarios, such as fine-tuning on the MMLU dataset and instruction tuning on the Oasst1 dataset. PaCA can also be combined with quantization, enabling the fine-tuning of large models such as LLaMA3.1-70B. In addition, PaCA enables training with 23% longer sequence and improves throughput by 16% on both NVIDIA A100 GPU and INTEL Gaudi2 HPU compared to LoRA. The code is available at https://github.com/WooSunghyeon/paca.

PaCA: Partial Connection Adaptation for Efficient Fine-Tuning

TL;DR

PaCA tackles the training-time and activation-memory bottlenecks of parameter-efficient fine-tuning by updating randomly selected partial connections within pretrained weights, avoiding the overhead of adapter layers. The approach integrates into standard forward/backward passes, and a Lipschitz-gradient-based convergence guarantee supports its theoretical soundness. Empirically, PaCA outperforms LoRA-style methods in training efficiency across MMLU and MT-Bench tasks and remains compatible with 4-bit quantization (QPaCA), enabling fine-tuning of extremely large models. The results show improved throughput and longer usable sequence lengths across CPU/GPU backends, and the method generalizes to vision transformers and CNNs, making PaCA a versatile alternative to existing PEFT schemes with practical impact for scalable fine-tuning.

Abstract

Prior parameter-efficient fine-tuning (PEFT) algorithms reduce memory usage and computational costs of fine-tuning large neural network models by training only a few additional adapter parameters, rather than the entire model. However, the reduction in computational costs due to PEFT does not necessarily translate to a reduction in training time; although the computational costs of the adapter layers are much smaller than the pretrained layers, it is well known that those two types of layers are processed sequentially on GPUs, resulting in significant latency overhead. LoRA and its variants merge low-rank adapter matrices with pretrained weights during inference to avoid latency overhead, but during training, the pretrained weights remain frozen while the adapter matrices are continuously updated, preventing such merging. To mitigate this issue, we propose Partial Connection Adaptation (PaCA), which fine-tunes randomly selected partial connections within the pretrained weights instead of introducing adapter layers in the model. PaCA not only enhances training speed by eliminating the time overhead due to the sequential processing of the adapter and pretrained layers but also reduces activation memory since only partial activations, rather than full activations, need to be stored for gradient computation. Compared to LoRA, PaCA reduces training time by 22% and total memory usage by 16%, while maintaining comparable accuracy across various fine-tuning scenarios, such as fine-tuning on the MMLU dataset and instruction tuning on the Oasst1 dataset. PaCA can also be combined with quantization, enabling the fine-tuning of large models such as LLaMA3.1-70B. In addition, PaCA enables training with 23% longer sequence and improves throughput by 16% on both NVIDIA A100 GPU and INTEL Gaudi2 HPU compared to LoRA. The code is available at https://github.com/WooSunghyeon/paca.

Paper Structure

This paper contains 18 sections, 2 theorems, 10 equations, 3 figures, 13 tables.

Key Result

Theorem 1

If the gradient of the loss function $f(\textbf{W}, \textbf{X})$ is Lipschitz continuous and the only partial connections are updated, then

Figures (3)

  • Figure 1: Overview of Partial Connections Adaptation (PaCA) algorithm.
  • Figure 2: The number of operations (TFLOPs) and training time (ms) per iteration when training LLaMA3-8B with full-fine tuning (Full-FT) and LoRA.
  • Figure 3: Training throughput (sentences/s) on a single NVIDIA A100 GPU and INTEL Gaudi2 HPU when fine-tuning LLaMA3-8B with a sequence length of 512.

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 1
  • proof