Table of Contents
Fetching ...

Generative Prompt Internalization

Haebin Shin, Lei Ji, Yeyun Gong, Sungdong Kim, Eunbi Choi, Minjoon Seo

TL;DR

GenPI addresses the inefficiency of fixed, lengthy prompts in large language model applications by learning to internalize prompts through a joint training regime that also generates the prompt content and the rationale for behavior changes. It combines a standard distillation objective with a Prompt Generation loss and uses lightweight prompt-specific adapters trained on a small synthetic dataset produced via Self Role-Playing, enabling effective prompt internalization without additional prompt inputs at inference. Across three agent-oriented benchmarks, GenPI achieves near- upper-bound performance on short prompts and maintains strong performance (>82%) on longer, over-1000-token prompts, while delivering substantial efficiency gains (up to 39% MACs/FLOPs and 17% latency) compared with prompt-based baselines. The approach offers a practical pathway to reducing inference costs in real-world LLM deployments by internalizing per-prompt behavior, with potential extensions to retrieval-augmented and multimodal settings.

Abstract

Prompts used in recent large language model based applications are often fixed and lengthy, leading to significant computational overhead. To address this challenge, we propose Generative Prompt Internalization (GenPI), a lightweight method that employs a joint training approach. GenPI not only replicates the behavior of models with prompt inputs but also generates the content of the prompt along with reasons for why the model's behavior should change accordingly. We demonstrate that our approach effectively internalizes complex prompts across various agent-based application scenarios. For effective training without interactions with the dedicated environments, we introduce a data synthesis technique that autonomously collects conversational datasets by swapping the roles of the agent and environment. This method is especially useful in scenarios where only a predefined prompt is available without a corresponding training dataset. By internalizing complex prompts, Generative Prompt Internalization enables high performance and efficient inference without the need for explicit prompts.

Generative Prompt Internalization

TL;DR

GenPI addresses the inefficiency of fixed, lengthy prompts in large language model applications by learning to internalize prompts through a joint training regime that also generates the prompt content and the rationale for behavior changes. It combines a standard distillation objective with a Prompt Generation loss and uses lightweight prompt-specific adapters trained on a small synthetic dataset produced via Self Role-Playing, enabling effective prompt internalization without additional prompt inputs at inference. Across three agent-oriented benchmarks, GenPI achieves near- upper-bound performance on short prompts and maintains strong performance (>82%) on longer, over-1000-token prompts, while delivering substantial efficiency gains (up to 39% MACs/FLOPs and 17% latency) compared with prompt-based baselines. The approach offers a practical pathway to reducing inference costs in real-world LLM deployments by internalizing per-prompt behavior, with potential extensions to retrieval-augmented and multimodal settings.

Abstract

Prompts used in recent large language model based applications are often fixed and lengthy, leading to significant computational overhead. To address this challenge, we propose Generative Prompt Internalization (GenPI), a lightweight method that employs a joint training approach. GenPI not only replicates the behavior of models with prompt inputs but also generates the content of the prompt along with reasons for why the model's behavior should change accordingly. We demonstrate that our approach effectively internalizes complex prompts across various agent-based application scenarios. For effective training without interactions with the dedicated environments, we introduce a data synthesis technique that autonomously collects conversational datasets by swapping the roles of the agent and environment. This method is especially useful in scenarios where only a predefined prompt is available without a corresponding training dataset. By internalizing complex prompts, Generative Prompt Internalization enables high performance and efficient inference without the need for explicit prompts.

Paper Structure

This paper contains 57 sections, 6 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Overview of Generative Prompt Internalization. SFT loss learns the teacher model’s behavior based on the user input. Prompt Generation loss internalizes the prompt by generating both the content of the prompt and the reason for why the model’s behavior should be modified. This process is guided by comparing the student model’s output ("AS-IS") with the teacher model’s output ("TO-BE"). SFT loss and Prompt Generation loss are combined into a joint loss to train the prompt-specific adaptor.
  • Figure 2: Self Role-Playing conversation. Collecting pseudo conversational output by switching the role in the prompt.
  • Figure 3: Comparison of computational overhead in LLaMA-based baselines as the conversation turn progresses. All generations within a turn are reported with KV caching pope2022kvcache applied. Best viewed in color.
  • Figure 4: Comparison of computational overhead in LLaMA-based baselines applying KV caching pope2022kvcache across the multi-turn conversation. Even if the previous contents are cached, a long context still creates extra overhead. Best viewed in color.
  • Figure 5: Agent Prompt for OS Interaction. Following the task setup from AgentBench liu2023agentbenchevaluatingllmsagents, we describe all content, including the system prompt and demonstrations, as a multi-turn strategy using <USER> and <AGENT>.
  • ...and 9 more figures