Table of Contents
Fetching ...

PromptIntern: Saving Inference Costs by Internalizing Recurrent Prompt during Large Language Model Fine-tuning

Jiaru Zou, Mengyu Zhou, Tao Li, Shi Han, Dongmei Zhang

TL;DR

A novel approach to fine-tuning large language models, PromptIntern, which internalizes prompt knowledge during model fine-tuning to achieve efficient inference and save costs and design a fine-tuning pipeline that includes instruction template compression, few-shot example absorption, and a progressive internalization strategy.

Abstract

Recent advances in fine-tuning large language models (LLMs) have greatly enhanced their usage in domain-specific tasks. Despite the success, fine-tuning continues to rely on repeated and lengthy prompts, which escalate computational expenses, require more resources, and lead to slower inference. In this paper, we present a novel approach, PromptIntern, which internalizes prompt knowledge during model fine-tuning to achieve efficient inference and save costs. Instead of compressing the prompts for a vanilla model, PromptIntern aims to embed the recurrent prompt directly into the model parameters. We design a fine-tuning pipeline that includes instruction template compression, few-shot example absorption, and a progressive internalization strategy, effectively diminishing the need for intricate prompts during inference. Comprehensive experiments on challenging NL2Code tasks demonstrate that our method reduces input tokens by more than 90%, accelerates inference by 4.2 times, and reduces monetary inference costs by 88.3%.

PromptIntern: Saving Inference Costs by Internalizing Recurrent Prompt during Large Language Model Fine-tuning

TL;DR

A novel approach to fine-tuning large language models, PromptIntern, which internalizes prompt knowledge during model fine-tuning to achieve efficient inference and save costs and design a fine-tuning pipeline that includes instruction template compression, few-shot example absorption, and a progressive internalization strategy.

Abstract

Recent advances in fine-tuning large language models (LLMs) have greatly enhanced their usage in domain-specific tasks. Despite the success, fine-tuning continues to rely on repeated and lengthy prompts, which escalate computational expenses, require more resources, and lead to slower inference. In this paper, we present a novel approach, PromptIntern, which internalizes prompt knowledge during model fine-tuning to achieve efficient inference and save costs. Instead of compressing the prompts for a vanilla model, PromptIntern aims to embed the recurrent prompt directly into the model parameters. We design a fine-tuning pipeline that includes instruction template compression, few-shot example absorption, and a progressive internalization strategy, effectively diminishing the need for intricate prompts during inference. Comprehensive experiments on challenging NL2Code tasks demonstrate that our method reduces input tokens by more than 90%, accelerates inference by 4.2 times, and reduces monetary inference costs by 88.3%.
Paper Structure (47 sections, 4 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 47 sections, 4 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: An illustration of PromptIntern: Like human interns, LLMs learn and internalize repeated prompt information such as templates and examples during fine-tuning, leading to efficient and effective inference.
  • Figure 2: Overview of PromptIntern framework. We structure the input prompt into three components: the template, examples, and query. By employing template compression and example absorption, we efficiently preprocess each component based on schedule $\mathcal{S}^{tmp},\mathcal{S}^{egs}$. We then use a progressive fine-tuning strategy to gradually incorporate prompt knowledge into the model parameters $\theta$, facilitating efficient inference without sacrificing performance.
  • Figure 3: An Example from NL2F demonstrating how an original prompt is preprocessed through template compression and example absorption in PromptIntern for progressive fine-tuning and final inference.
  • Figure 4: The instruction baseline for the baseline method GPT-4 Generation.
  • Figure 5: Prompts of MBPP
  • ...and 2 more figures