PromptIntern: Saving Inference Costs by Internalizing Recurrent Prompt during Large Language Model Fine-tuning

Jiaru Zou; Mengyu Zhou; Tao Li; Shi Han; Dongmei Zhang

PromptIntern: Saving Inference Costs by Internalizing Recurrent Prompt during Large Language Model Fine-tuning

Jiaru Zou, Mengyu Zhou, Tao Li, Shi Han, Dongmei Zhang

TL;DR

A novel approach to fine-tuning large language models, PromptIntern, which internalizes prompt knowledge during model fine-tuning to achieve efficient inference and save costs and design a fine-tuning pipeline that includes instruction template compression, few-shot example absorption, and a progressive internalization strategy.

Abstract

Recent advances in fine-tuning large language models (LLMs) have greatly enhanced their usage in domain-specific tasks. Despite the success, fine-tuning continues to rely on repeated and lengthy prompts, which escalate computational expenses, require more resources, and lead to slower inference. In this paper, we present a novel approach, PromptIntern, which internalizes prompt knowledge during model fine-tuning to achieve efficient inference and save costs. Instead of compressing the prompts for a vanilla model, PromptIntern aims to embed the recurrent prompt directly into the model parameters. We design a fine-tuning pipeline that includes instruction template compression, few-shot example absorption, and a progressive internalization strategy, effectively diminishing the need for intricate prompts during inference. Comprehensive experiments on challenging NL2Code tasks demonstrate that our method reduces input tokens by more than 90%, accelerates inference by 4.2 times, and reduces monetary inference costs by 88.3%.

PromptIntern: Saving Inference Costs by Internalizing Recurrent Prompt during Large Language Model Fine-tuning

TL;DR

Abstract

Paper Structure (47 sections, 4 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 47 sections, 4 equations, 7 figures, 9 tables, 1 algorithm.

Introduction
Related Work
Problem Formulation
Methodology
Template Compression
Example Absorption
PromptIntern Pipeline
Experiment
Settings
Datasets
Evaluation Metrics
Baselines
Models
Implementation Details
Prompt Compression Comparison
...and 32 more sections

Figures (7)

Figure 1: An illustration of PromptIntern: Like human interns, LLMs learn and internalize repeated prompt information such as templates and examples during fine-tuning, leading to efficient and effective inference.
Figure 2: Overview of PromptIntern framework. We structure the input prompt into three components: the template, examples, and query. By employing template compression and example absorption, we efficiently preprocess each component based on schedule $\mathcal{S}^{tmp},\mathcal{S}^{egs}$. We then use a progressive fine-tuning strategy to gradually incorporate prompt knowledge into the model parameters $\theta$, facilitating efficient inference without sacrificing performance.
Figure 3: An Example from NL2F demonstrating how an original prompt is preprocessed through template compression and example absorption in PromptIntern for progressive fine-tuning and final inference.
Figure 4: The instruction baseline for the baseline method GPT-4 Generation.
Figure 5: Prompts of MBPP
...and 2 more figures

PromptIntern: Saving Inference Costs by Internalizing Recurrent Prompt during Large Language Model Fine-tuning

TL;DR

Abstract

PromptIntern: Saving Inference Costs by Internalizing Recurrent Prompt during Large Language Model Fine-tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)