Table of Contents
Fetching ...

PARA: Parameter-Efficient Fine-tuning with Prompt Aware Representation Adjustment

Zequan Liu, Yi Zhao, Ming Tan, Wei Zhu, Aaron Xuxiang Tian

TL;DR

PARA introduces a prompt-aware representation adjustment (PARA) mechanism for parameter-efficient fine-tuning of large language models. By embedding a lightweight vector generator in every Transformer layer, PARA produces prompt-conditioned vectors that modulate Q, V, and FFN pathways, enabling efficient KV-cache-friendly inference. With approximately 8.9M additional tunable parameters on an $7$B backbone, PARA achieves superior or competitive performance across multiple benchmarks (SQuAD, BoolQ, COPA, HSM10K, Q2SQL) compared to LoRA-based and other PEFT baselines, while reducing latency in multi-tenant settings. The method demonstrates robustness across backbones (7B, 13B, Gemma 2B) and highlights practical industrial applicability for MaaS scenarios, albeit with limitations on ultra-large models and broader NLP tasks.

Abstract

In the realm of parameter-efficient fine-tuning (PEFT) methods, while options like LoRA are available, there is a persistent demand in the industry for a PEFT approach that excels in both efficiency and performance within the context of single-backbone multi-tenant applications. This paper introduces a new and straightforward PEFT technique, termed \underline{P}rompt \underline{A}ware \underline{R}epresentation \underline{A}djustment (PARA). The core of our proposal is to integrate a lightweight vector generator within each Transformer layer. This generator produces vectors that are responsive to input prompts, thereby adjusting the hidden representations accordingly. Our extensive experimentation across diverse tasks has yielded promising results. Firstly, the PARA method has been shown to surpass current PEFT benchmarks in terms of performance, despite having a similar number of adjustable parameters. Secondly, it has proven to be more efficient than LoRA in the single-backbone multi-tenant scenario, highlighting its significant potential for industrial adoption.

PARA: Parameter-Efficient Fine-tuning with Prompt Aware Representation Adjustment

TL;DR

PARA introduces a prompt-aware representation adjustment (PARA) mechanism for parameter-efficient fine-tuning of large language models. By embedding a lightweight vector generator in every Transformer layer, PARA produces prompt-conditioned vectors that modulate Q, V, and FFN pathways, enabling efficient KV-cache-friendly inference. With approximately 8.9M additional tunable parameters on an B backbone, PARA achieves superior or competitive performance across multiple benchmarks (SQuAD, BoolQ, COPA, HSM10K, Q2SQL) compared to LoRA-based and other PEFT baselines, while reducing latency in multi-tenant settings. The method demonstrates robustness across backbones (7B, 13B, Gemma 2B) and highlights practical industrial applicability for MaaS scenarios, albeit with limitations on ultra-large models and broader NLP tasks.

Abstract

In the realm of parameter-efficient fine-tuning (PEFT) methods, while options like LoRA are available, there is a persistent demand in the industry for a PEFT approach that excels in both efficiency and performance within the context of single-backbone multi-tenant applications. This paper introduces a new and straightforward PEFT technique, termed \underline{P}rompt \underline{A}ware \underline{R}epresentation \underline{A}djustment (PARA). The core of our proposal is to integrate a lightweight vector generator within each Transformer layer. This generator produces vectors that are responsive to input prompts, thereby adjusting the hidden representations accordingly. Our extensive experimentation across diverse tasks has yielded promising results. Firstly, the PARA method has been shown to surpass current PEFT benchmarks in terms of performance, despite having a similar number of adjustable parameters. Secondly, it has proven to be more efficient than LoRA in the single-backbone multi-tenant scenario, highlighting its significant potential for industrial adoption.

Paper Structure

This paper contains 15 sections, 6 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: A schematic representation of our PARA approach is depicted below. On the left, the vector generator is composed of several components, including a pooler, a down-projection layer, an activation function, and an up-projection layer. This generator takes the hidden states of the prompt as input and produces adjusting vectors as output. On the right, these adjusting vectors are used to scale the Query (Q) and Value (V) hidden states within the MHSA (Multi-Head Self-Attention) module, as well as the Up (U) hidden states within the feed-forward network.