Table of Contents
Fetching ...

PEDRO: Parameter-Efficient Fine-tuning with Prompt DEpenDent Representation MOdification

Tianfang Xie, Tianjing Li, Wei Zhu, Wei Han, Yi Zhao

TL;DR

PEDRO addresses efficient parameter-efficient fine-tuning for large language models in single-backbone multi-tenant deployments by introducing per-layer Vector Generators that produce prompt-conditioned adjustment vectors to modulate Q,V,U via $Q' = l_q \odot Q$, $V' = l_v \odot V$, and $U' = l_u \odot U$ and are reused with the KV-cache for low latency. The method uses a lightweight VG with a pooling step and down/up projections to generate $l_q,l_v,l_u$, with a learnable rational activation function and bi-level optimization to adapt activations across layers and prompts. Empirical results across GLUE/SQuAD, MT-Bench, MMLU, BBH, Alpaca, and other benchmarks show PEDRO consistently outperforms strong PEFT baselines at comparable parameter budgets, and achieves faster inference than LoRA in multi-tenant settings. This demonstrates PEDRO's practical relevance for industrial deployment of LLMs, enabling efficient, prompt-aware fine-tuning without sacrificing performance.

Abstract

Due to their substantial sizes, large language models (LLMs) are typically deployed within a single-backbone multi-tenant framework. In this setup, a single instance of an LLM backbone must cater to multiple users or tasks through the application of various parameter-efficient fine-tuning (PEFT) models. Despite the availability of numerous effective PEFT techniques such as LoRA, there remains a need for a PEFT approach that achieves both high efficiency during inference and competitive performance on downstream tasks. In this research, we introduce a new and straightforward PEFT methodology named \underline{P}rompt D\underline{E}pen\underline{D}ent \underline{R}epresentation M\underline{O}dification (PEDRO). The proposed method involves integrating a lightweight vector generator into each Transformer layer, which generates vectors contingent upon the input prompts. These vectors then modify the hidden representations created by the LLM through a dot product operation, thereby influencing the semantic output and generated content of the model. Extensive experimentation across a variety of tasks indicates that: (a) PEDRO surpasses recent PEFT benchmarks when using a similar number of tunable parameters. (b) Under the single-backbone multi-tenant deployment model, PEDRO exhibits superior efficiency compared to LoRA, indicating significant industrial potential.

PEDRO: Parameter-Efficient Fine-tuning with Prompt DEpenDent Representation MOdification

TL;DR

PEDRO addresses efficient parameter-efficient fine-tuning for large language models in single-backbone multi-tenant deployments by introducing per-layer Vector Generators that produce prompt-conditioned adjustment vectors to modulate Q,V,U via , , and and are reused with the KV-cache for low latency. The method uses a lightweight VG with a pooling step and down/up projections to generate , with a learnable rational activation function and bi-level optimization to adapt activations across layers and prompts. Empirical results across GLUE/SQuAD, MT-Bench, MMLU, BBH, Alpaca, and other benchmarks show PEDRO consistently outperforms strong PEFT baselines at comparable parameter budgets, and achieves faster inference than LoRA in multi-tenant settings. This demonstrates PEDRO's practical relevance for industrial deployment of LLMs, enabling efficient, prompt-aware fine-tuning without sacrificing performance.

Abstract

Due to their substantial sizes, large language models (LLMs) are typically deployed within a single-backbone multi-tenant framework. In this setup, a single instance of an LLM backbone must cater to multiple users or tasks through the application of various parameter-efficient fine-tuning (PEFT) models. Despite the availability of numerous effective PEFT techniques such as LoRA, there remains a need for a PEFT approach that achieves both high efficiency during inference and competitive performance on downstream tasks. In this research, we introduce a new and straightforward PEFT methodology named \underline{P}rompt D\underline{E}pen\underline{D}ent \underline{R}epresentation M\underline{O}dification (PEDRO). The proposed method involves integrating a lightweight vector generator into each Transformer layer, which generates vectors contingent upon the input prompts. These vectors then modify the hidden representations created by the LLM through a dot product operation, thereby influencing the semantic output and generated content of the model. Extensive experimentation across a variety of tasks indicates that: (a) PEDRO surpasses recent PEFT benchmarks when using a similar number of tunable parameters. (b) Under the single-backbone multi-tenant deployment model, PEDRO exhibits superior efficiency compared to LoRA, indicating significant industrial potential.
Paper Structure (14 sections, 8 equations, 3 figures, 5 tables)

This paper contains 14 sections, 8 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Schematic illustration of our PEDRO method. Left: The vector generator consists of a pooler, a down-projection, an activation function, and an up-projection. The vector generator uses the prompt' hidden states as the input and outputs the adjusting vectors. Right: The adjusting vectors multiply the Query (Q) and Value (V) hidden states in the MHSA module and the Up (U) hidden states in the feed-forward module.
  • Figure 2: The learned activation functions for the vector generators at different Transformer layers.
  • Figure 3: Performances under different tunable parameter budgets. The $x$-axis represents the number of tunable parameters, and the $y$-axis represents the performance score.