Table of Contents
Fetching ...

PEMA: An Offsite-Tunable Plug-in External Memory Adaptation for Language Models

HyunJin Kim, Young Jin Kim, JinYeong Bak

TL;DR

PEMA tackles privacy-constrained fine-tuning of proprietary PLMs by introducing an offsite-tunable PEFT method that builds an external memory of PLM context representations and employs a LoRA-like bottleneck adapter. Training uses a two-phase process to preserve PLM knowledge while learning task-specific next-token predictions, and inference blends PLM and PEMA outputs with Gradual Unrolling to emphasize task-specific generation early and context-rich language modeling later. Empirical results on WMT22 EN→DE and GYAFC demonstrate strong memory and latency efficiency alongside improved translation and formality-transfer quality, outperforming several baselines and showing the value of the Gradual Unrolling and reconstruction components. The work highlights a viable path for privacy-preserving, offsite-tunable adaptation of confidential PLMs and provides actionable guidance on hyperparameters and component contributions for future research.

Abstract

Pre-trained language models (PLMs) show impressive performance in various downstream NLP tasks. However, pre-training large language models demands substantial memory and training compute. Furthermore, due to the substantial resources required, many PLM weights are confidential. Consequently, users are compelled to share their data with model owners for fine-tuning specific tasks. To overcome the limitations, we introduce Plug-in External Memory Adaptation (PEMA), a Parameter-Efficient Fine-Tuning (PEFT) method, enabling PLM fine-tuning without requiring access to all the weights. PEMA integrates with context representations from test data during inference to perform downstream tasks. It uses external memory to store PLM-generated context representations mapped with target tokens. Our method utilizes weight matrices of LoRA-like bottlenecked adapter in the PLM's final layer to enhance efficiency. Our approach also includes Gradual Unrolling, a novel interpolation strategy to improve generation quality. We validate PEMA's effectiveness through experiments on syntactic and real datasets for machine translation and style transfer. Our findings show that PEMA outperforms other PEFT approaches in memory and latency efficiency for training, and also excels in maintaining sentence meaning and generating appropriate language and styles.

PEMA: An Offsite-Tunable Plug-in External Memory Adaptation for Language Models

TL;DR

PEMA tackles privacy-constrained fine-tuning of proprietary PLMs by introducing an offsite-tunable PEFT method that builds an external memory of PLM context representations and employs a LoRA-like bottleneck adapter. Training uses a two-phase process to preserve PLM knowledge while learning task-specific next-token predictions, and inference blends PLM and PEMA outputs with Gradual Unrolling to emphasize task-specific generation early and context-rich language modeling later. Empirical results on WMT22 EN→DE and GYAFC demonstrate strong memory and latency efficiency alongside improved translation and formality-transfer quality, outperforming several baselines and showing the value of the Gradual Unrolling and reconstruction components. The work highlights a viable path for privacy-preserving, offsite-tunable adaptation of confidential PLMs and provides actionable guidance on hyperparameters and component contributions for future research.

Abstract

Pre-trained language models (PLMs) show impressive performance in various downstream NLP tasks. However, pre-training large language models demands substantial memory and training compute. Furthermore, due to the substantial resources required, many PLM weights are confidential. Consequently, users are compelled to share their data with model owners for fine-tuning specific tasks. To overcome the limitations, we introduce Plug-in External Memory Adaptation (PEMA), a Parameter-Efficient Fine-Tuning (PEFT) method, enabling PLM fine-tuning without requiring access to all the weights. PEMA integrates with context representations from test data during inference to perform downstream tasks. It uses external memory to store PLM-generated context representations mapped with target tokens. Our method utilizes weight matrices of LoRA-like bottlenecked adapter in the PLM's final layer to enhance efficiency. Our approach also includes Gradual Unrolling, a novel interpolation strategy to improve generation quality. We validate PEMA's effectiveness through experiments on syntactic and real datasets for machine translation and style transfer. Our findings show that PEMA outperforms other PEFT approaches in memory and latency efficiency for training, and also excels in maintaining sentence meaning and generating appropriate language and styles.
Paper Structure (33 sections, 6 equations, 6 figures, 19 tables)

This paper contains 33 sections, 6 equations, 6 figures, 19 tables.

Figures (6)

  • Figure 1: A motivation for PEMA. (a) The data owners who want to fine-tune PLMs encounter a problem when the PLM owner refuses to share all the weights of the PLM. (b) In the PEMA training phase, the data owner takes a CR from the PLM owner by providing a context prompt. They subsequently train their PEMA model with their dataset. (c) At inference, the data owner takes a CR for test data from the PLM owner. Using Gradual Unrolling (GU), they generate the next-token by interpolating between PEMA and PLM next-token probabilities.
  • Figure 2: An illustration of PEMA. The areas of the PLM owner and the data owner are separated by the blue horizontal line. The data owner can train and infer using only the PLM's LM head. PEMA builds an external memory from the training context with an instruction $[Inst]$ given to a PLM. The PLM outputs the representation $f(c_i)$ and predicts the next-token distribution $P_{LM}(\hat{w}_i)$. The representation $f(c_i)$ is then aligned with its target $y_i$. In the training phase, PEMA uses external memory for two tasks: preserving the original representation via reconstruction training with $B_{rct}$ and generating a target token probability distribution using $B_{pd}$. For inference, the model inputs a test data representation to generate two probability distributions: $P_{LM}(\hat{w}_i)$ and $P_{PEMA}(\hat{w}_i)$. These are then interpolated using Gradual Unrolling to obtain the final token distribution.
  • Figure 3: The intuition of Gradual Unrolling. Given the input sentence (Black), the interpolation percentage of the adaptation model (Blue) decreases gradually while that of the language model (Red) increases as the sentence is being generated. This strategy ensures that the adaptation model generates tokens trained for the desired task at the beginning of the sentence, and the language model provides the necessary context in the remaining part of the sentence.
  • Figure 4: Performance variations on the WMT22 task with interpolation values $\lambda_{max}$ (left) and $\kappa$ (right). For $\lambda_{max}$, using Gradual Unrolling ($GU$) prevents performance degradation and enhances results, unlike without $GU$, where performance drops sharply. With $\kappa$ when $\lambda_{max}$ is set at 0.7, combining reconstruction loss with next-token prediction loss improves performance over excluding reconstruction loss (red dotted line), as indicated by better results when $\kappa$ is above zero.
  • Figure 5: Performance variation for each interpolation value $\lambda_{max}$ in the WMT22 task. With both Gradual Unrolling ($GU$) (blue) and without $GU$ (red), there is a decline in performance at a specific point of $\lambda_{max}$. However, when utilizing $GU$, the model is not only robust to performance degradation but also gains performance improvement.
  • ...and 1 more figures