Table of Contents
Fetching ...

NVCiM-PT: An NVCiM-assisted Prompt Tuning Framework for Edge LLMs

Ruiyang Qin, Pengyu Ren, Zheyu Yan, Liu Liu, Dancheng Liu, Amir Nassereldine, Jinjun Xiong, Kai Ni, Sharon Hu, Yiyu Shi

TL;DR

A novel NVCiM-assisted PT framework is introduced, where the core operations are narrow down the core operations to matrix-matrix multiplication, which can then be accelerated by performing in-situ computation on NVCiM.

Abstract

Large Language Models (LLMs) deployed on edge devices, known as edge LLMs, need to continuously fine-tune their model parameters from user-generated data under limited resource constraints. However, most existing learning methods are not applicable for edge LLMs because of their reliance on high resources and low learning capacity. Prompt tuning (PT) has recently emerged as an effective fine-tuning method for edge LLMs by only modifying a small portion of LLM parameters, but it suffers from user domain shifts, resulting in repetitive training and losing resource efficiency. Conventional techniques to address domain shift issues often involve complex neural networks and sophisticated training, which are incompatible for PT for edge LLMs. Therefore, an open research question is how to address domain shift issues for edge LLMs with limited resources. In this paper, we propose a prompt tuning framework for edge LLMs, exploiting the benefits offered by non-volatile computing-in-memory (NVCiM) architectures. We introduce a novel NVCiM-assisted PT framework, where we narrow down the core operations to matrix-matrix multiplication, which can then be accelerated by performing in-situ computation on NVCiM. To the best of our knowledge, this is the first work employing NVCiM to improve the edge LLM PT performance.

NVCiM-PT: An NVCiM-assisted Prompt Tuning Framework for Edge LLMs

TL;DR

A novel NVCiM-assisted PT framework is introduced, where the core operations are narrow down the core operations to matrix-matrix multiplication, which can then be accelerated by performing in-situ computation on NVCiM.

Abstract

Large Language Models (LLMs) deployed on edge devices, known as edge LLMs, need to continuously fine-tune their model parameters from user-generated data under limited resource constraints. However, most existing learning methods are not applicable for edge LLMs because of their reliance on high resources and low learning capacity. Prompt tuning (PT) has recently emerged as an effective fine-tuning method for edge LLMs by only modifying a small portion of LLM parameters, but it suffers from user domain shifts, resulting in repetitive training and losing resource efficiency. Conventional techniques to address domain shift issues often involve complex neural networks and sophisticated training, which are incompatible for PT for edge LLMs. Therefore, an open research question is how to address domain shift issues for edge LLMs with limited resources. In this paper, we propose a prompt tuning framework for edge LLMs, exploiting the benefits offered by non-volatile computing-in-memory (NVCiM) architectures. We introduce a novel NVCiM-assisted PT framework, where we narrow down the core operations to matrix-matrix multiplication, which can then be accelerated by performing in-situ computation on NVCiM. To the best of our knowledge, this is the first work employing NVCiM to improve the edge LLM PT performance.

Paper Structure

This paper contains 21 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Edge LLM performance comparison for two LLMs (Gemma-2B and Phi-2) across four datasets on four prompt tuning methods---Vanilla, DEPT, P-tuning v2 (P-t* v2), and prefix tuning with OVT (optimal sets of virtual tokens).
  • Figure 2: Resource by the storing virtual tokens and data moving.
  • Figure 3: Overview of our proposed NVCiM-based Prompt Tuning framework (NVCiM-PT). By co-design, it can utilize different types of NVCiM to improve prompt-tuning-based LLM content generation.
  • Figure 4: Implementation of our scaled search algorithm on NVCiM
  • Figure 5: Evaluation of our SSA on CPU, RRAM, and FeFET