Table of Contents
Fetching ...

PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark

Robert Belanec, Branislav Pecher, Ivan Srba, Maria Bielikova

TL;DR

PEFT-Bench delivers a unified, open benchmark for parameter-efficient fine-tuning of autoregressive LLMs, standardizing datasets, methods, and metrics across 27 NLP tasks. It introduces the PSCP metric to jointly capture performance with trainable parameters, inference FLOPs, and training memory, enabling fair comparisons. The framework demonstrates that methods like LoRA and LNTuning offer different tradeoffs between accuracy and efficiency, while soft-prompting methods tend to be harder to train and less stable. This work lays the foundation for reproducible PEFT evaluation and future extensions, including a web interface and multi-task analyses, to guide practical deployments under resource constraints.

Abstract

Despite the state-of-the-art performance of Large Language Models (LLMs) achieved on many tasks, their massive scale often leads to high computational and environmental costs, limiting their accessibility. Parameter-efficient fine-tuning (PEFT) methods address this challenge by reducing the number of trainable parameters while maintaining strong downstream performance. Despite the increased development in PEFT methods, current evaluations remain limited (in terms of evaluated models and datasets) and difficult to reproduce. To bridge this gap, we introduce PEFT-Bench, a unified end-to-end benchmark for evaluating diverse PEFT methods on autoregressive LLMs. We demonstrate its usage across 27 NLP datasets and 6 PEFT methods. To account for different PEFT training and inference factors, we also introduce the PEFT Soft Score Penalties (PSCP) metric, which takes trainable parameters, inference speed, and training memory usage into account.

PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark

TL;DR

PEFT-Bench delivers a unified, open benchmark for parameter-efficient fine-tuning of autoregressive LLMs, standardizing datasets, methods, and metrics across 27 NLP tasks. It introduces the PSCP metric to jointly capture performance with trainable parameters, inference FLOPs, and training memory, enabling fair comparisons. The framework demonstrates that methods like LoRA and LNTuning offer different tradeoffs between accuracy and efficiency, while soft-prompting methods tend to be harder to train and less stable. This work lays the foundation for reproducible PEFT evaluation and future extensions, including a web interface and multi-task analyses, to guide practical deployments under resource constraints.

Abstract

Despite the state-of-the-art performance of Large Language Models (LLMs) achieved on many tasks, their massive scale often leads to high computational and environmental costs, limiting their accessibility. Parameter-efficient fine-tuning (PEFT) methods address this challenge by reducing the number of trainable parameters while maintaining strong downstream performance. Despite the increased development in PEFT methods, current evaluations remain limited (in terms of evaluated models and datasets) and difficult to reproduce. To bridge this gap, we introduce PEFT-Bench, a unified end-to-end benchmark for evaluating diverse PEFT methods on autoregressive LLMs. We demonstrate its usage across 27 NLP datasets and 6 PEFT methods. To account for different PEFT training and inference factors, we also introduce the PEFT Soft Score Penalties (PSCP) metric, which takes trainable parameters, inference speed, and training memory usage into account.

Paper Structure

This paper contains 27 sections, 3 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Diagram describing the methodology of PEFT-Bench. Blue components represent our contributions. We design PEFT-Factory, a framework based on LLaMa-Factory zheng2024llamafactory backbone to implement off-the-shelf methods from the HuggingFace PEFT library and an easy-to-use interface for new PEFT methods. Using these methods, we train LLaMa on selected datasets, which we have also included in the backbone. After training, we evaluate and compute the metrics for each model, method, and dataset combination.
  • Figure 2: A diagram showing the overview and categorizations of datasets used in PEFT-Bench, totaling 27 datasets categorized into 3 main groups -- NLU and Reasoning, Math, and Code Generation.
  • Figure 3: We evaluate methods from additive, reparametrized, and selective PEFT categories. The diagram shows the categorization of each method.
  • Figure 4: Bar chart showing the stability of different PEFT methods on 4 low-resource datasets. IA$^3$ achieves the lowest standard deviation on average across all datasets, and LoRA is less stable than other methods with CB datasets, while also achieving a better average score. Additionally, P-Tuning generally achieves the worst results in terms of performance and stability. For numerical results please see Table \ref{['tab:app:results-stability']} in Appendix \ref{['app:additional_results']}.