MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter

Jitai Hao; WeiWei Sun; Xin Xin; Qi Meng; Zhumin Chen; Pengjie Ren; Zhaochun Ren

MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter

Jitai Hao, WeiWei Sun, Xin Xin, Qi Meng, Zhumin Chen, Pengjie Ren, Zhaochun Ren

TL;DR

MEFT tackles the memory bottlenecks of fine-tuning large language models on knowledge-intensive tasks by offloading large adapters to CPU memory and employing activation sparsity to retrieve only highly relevant neurons to the GPU. A Mixture-of-Experts–inspired Key-Experts mechanism further minimizes CPU computations and CPU-GPU communication, reducing the nominal complexity from $O(dNM)$ to $O(dN\sqrt{M})$ for retrieving relevant parameters. Empirical results on LLaMA-7B and Mistral-7B across Natural Questions, SQuAD, ToolBench, and GSM8K show that MEFT achieves state-of-the-art or competitive performance within 24G GPU memory, cutting memory usage roughly in half while sustaining training efficiency. These advances enable effective knowledge adaptation of large models under limited hardware, with broad implications for resource-constrained fine-tuning workflows.

Abstract

Parameter-Efficient Fine-tuning (PEFT) facilitates the fine-tuning of Large Language Models (LLMs) under limited resources. However, the fine-tuning performance with PEFT on complex, knowledge-intensive tasks is limited due to the constrained model capacity, which originates from the limited number of additional trainable parameters. To overcome this limitation, we introduce a novel mechanism that fine-tunes LLMs with adapters of larger size yet memory-efficient. This is achieved by leveraging the inherent activation sparsity in the Feed-Forward Networks (FFNs) of LLMs and utilizing the larger capacity of Central Processing Unit (CPU) memory compared to Graphics Processing Unit (GPU). We store and update the parameters of larger adapters on the CPU. Moreover, we employ a Mixture of Experts (MoE)-like architecture to mitigate unnecessary CPU computations and reduce the communication volume between the GPU and CPU. This is particularly beneficial over the limited bandwidth of PCI Express (PCIe). Our method can achieve fine-tuning results comparable to those obtained with larger memory capacities, even when operating under more limited resources such as a 24GB memory single GPU setup, with acceptable loss in training efficiency. Our codes are available at https://github.com/CURRENTF/MEFT.

MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter

TL;DR

for retrieving relevant parameters. Empirical results on LLaMA-7B and Mistral-7B across Natural Questions, SQuAD, ToolBench, and GSM8K show that MEFT achieves state-of-the-art or competitive performance within 24G GPU memory, cutting memory usage roughly in half while sustaining training efficiency. These advances enable effective knowledge adaptation of large models under limited hardware, with broad implications for resource-constrained fine-tuning workflows.

Abstract

Paper Structure (41 sections, 7 equations, 9 figures, 3 tables, 1 algorithm)

This paper contains 41 sections, 7 equations, 9 figures, 3 tables, 1 algorithm.

Introduction
Preliminary
Parallel Adapter
Sparsity in Parallel Adapter
Method
Sparse Activation
Key-Experts Mechanism
Efficiency Analysis
Communication Volume
Computational Complexity
Empirical Results
Experimental Setup
Datasets
Metrics
Implementation Details
...and 26 more sections

Figures (9)

Figure 1: Accuracy performance of different PEFT methods on Natural Questions with the rise in the number of trainable parameters. The orange part denotes that the model has reached its fine-tuning limit with a 24GB GPU. The blue part shows performance would be decrease when trainable parameters are limited.
Figure 2: Sparsity analysis on Parallel Adapter with $4096$ neurons. The neurons are sorted based on activation values. Only a subset of neurons (left part) exhibit high activation value, while majority of neurons are unactivated and not contribute to model's predictions.
Figure 3: Overview of our MEFT. The dotted line divides the parameters into two parts, which would be placed on the GPU (left part) and CPU (right part), respectively. Most of the trainable parameters will be allocated to the CPU. During the forward propagation stage, the output of the attention block will be transferred to the CPU to efficiently retrieve neurons highly related to the current context using a MoE-like structure, after which the activated neurons will be transferred to the GPU. During the backward propagation, we transfer the gradients to the CPU and update parameters on the CPU. The above block shown for one Transformer layer is repeated across all the layers.
Figure 4: Performance comparsion between MEFT w/o KE and MEFT.
Figure 5: Ablation study on latency(ms) per batch relative to Parallel Adapter.
...and 4 more figures

MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter

TL;DR

Abstract

MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter

Authors

TL;DR

Abstract

Table of Contents

Figures (9)