Table of Contents
Fetching ...

Behavior-Equivalent Token: Single-Token Replacement for Long Prompts in LLMs

Jiancheng Dong, Pengyue Jia, Jingyu Peng, Maolin Wang, Yuhao Wang, Lixin Su, Xin Sun, Shuaiqiang Wang, Dawei Yin, Xiangyu Zhao

TL;DR

The paper tackles the inefficiencies of long system prompts by introducing the Behavior-Equivalent Token ([BE]), a single learned token that preserves a long prompt's effect. A self-contained three-stage training pipeline uses a universal [AE] reconstruction trigger, prompt-specific [BE] embedding, and knowledge distillation to align downstream behavior with the full prompt. Empirical results show up to $3000\times$ compression with about 98% retention of performance across RoleLLM, GSM8K, and HPD tasks, along with notable inference efficiency gains. This approach eliminates the need for external encoders or labeled data, enabling scalable and practical prompt compression for diverse LLM deployments.

Abstract

Carefully engineered system prompts play a critical role in guiding the behavior of LLM agents, but their considerable length introduces significant drawbacks, including increased inference latency, higher computational cost, and reduced effective context length. This raises the question of whether such lengthy prompts can be replaced by a drastically reduced number of tokens while preserving their behavioral effect on downstream tasks. To enable this, we propose a lightweight three-stage training framework that learns a single prompt-specific Behavior-Equivalent token ([BE]). The framework first trains [BE] to encode the natural-language content of the original system prompt via reconstruction, and then distills the prompt 's downstream behavior into this single token. Importantly, our method requires no access to model internals, no auxiliary compression models, and no labeled responses. Empirical evaluations on three datasets show that a single [BE] token achieves up to a 3000x reduction in prompt length, while retaining about 98% of the downstream performance of the original system prompts. This substantially reduces inference cost and leaves almost the entire context window available for user inputs.

Behavior-Equivalent Token: Single-Token Replacement for Long Prompts in LLMs

TL;DR

The paper tackles the inefficiencies of long system prompts by introducing the Behavior-Equivalent Token ([BE]), a single learned token that preserves a long prompt's effect. A self-contained three-stage training pipeline uses a universal [AE] reconstruction trigger, prompt-specific [BE] embedding, and knowledge distillation to align downstream behavior with the full prompt. Empirical results show up to compression with about 98% retention of performance across RoleLLM, GSM8K, and HPD tasks, along with notable inference efficiency gains. This approach eliminates the need for external encoders or labeled data, enabling scalable and practical prompt compression for diverse LLM deployments.

Abstract

Carefully engineered system prompts play a critical role in guiding the behavior of LLM agents, but their considerable length introduces significant drawbacks, including increased inference latency, higher computational cost, and reduced effective context length. This raises the question of whether such lengthy prompts can be replaced by a drastically reduced number of tokens while preserving their behavioral effect on downstream tasks. To enable this, we propose a lightweight three-stage training framework that learns a single prompt-specific Behavior-Equivalent token ([BE]). The framework first trains [BE] to encode the natural-language content of the original system prompt via reconstruction, and then distills the prompt 's downstream behavior into this single token. Importantly, our method requires no access to model internals, no auxiliary compression models, and no labeled responses. Empirical evaluations on three datasets show that a single [BE] token achieves up to a 3000x reduction in prompt length, while retaining about 98% of the downstream performance of the original system prompts. This substantially reduces inference cost and leaves almost the entire context window available for user inputs.

Paper Structure

This paper contains 34 sections, 6 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: A single learned token [BE] replaces a long system prompt (up to 3,000 tokens), eliciting nearly identical responses from the LLM.
  • Figure 2: Top: Comparison with prior work. (a) Memory tokens that directly trigger verbatim reconstruction of a long text tend to function as rote triggers and do not transfer information to downstream tasks. (b) Prompt tuning methods learn from labeled examples but struggle to converge and often fail to fulfill the specific requirements designated in the original prompt. Bottom: Our three-stage pipeline. (c) Pre-train a universal reconstruction token [AE] to elicit text reconstruction. (d) Train a single prompt-specific [BE] so that [BE]+[AE] reconstructs the target system prompt. (e--f) Distill the full prompt's downstream behavior into [BE]. Trainable tokens are marked with a flame; the base LLM is frozen throughout.
  • Figure 3: Sensitivity of behavior alignment to loss weights and teacher choice.
  • Figure 4: Outputs under five prefixing strategies on the same persona-role query.
  • Figure 5: Losses of Stage 1 and Stage 2 grouped by loss weight $\lambda$. Rows correspond to $\lambda \in \{0.1, 0.5, 0.9, 1.0\}$ from top to bottom. Columns correspond to supervision type: left = original answer, right = self-generated.