LanguaShrink: Reducing Token Overhead with Psycholinguistics

Xuechen Liang; Meiling Tao; Yinghui Xia; Tianyu Shi; Jun Wang; JingSong Yang

LanguaShrink: Reducing Token Overhead with Psycholinguistics

Xuechen Liang, Meiling Tao, Yinghui Xia, Tianyu Shi, Jun Wang, JingSong Yang

TL;DR

The framework introduces part-of-speech priority compression and data distillation techniques, using smaller models to learn compression targets and employing a KL-regularized reinforcement learning strategy for training to achieve task-agnostic prompt compression.

Abstract

As large language models (LLMs) improve their capabilities in handling complex tasks, the issues of computational cost and efficiency due to long prompts are becoming increasingly prominent. To accelerate model inference and reduce costs, we propose an innovative prompt compression framework called LanguaShrink. Inspired by the observation that LLM performance depends on the density and position of key information in the input prompts, LanguaShrink leverages psycholinguistic principles and the Ebbinghaus memory curve to achieve task-agnostic prompt compression. This effectively reduces prompt length while preserving essential information. We referred to the training method of OpenChat.The framework introduces part-of-speech priority compression and data distillation techniques, using smaller models to learn compression targets and employing a KL-regularized reinforcement learning strategy for training.\cite{wang2023openchat} Additionally, we adopt a chunk-based compression algorithm to achieve adjustable compression rates. We evaluate our method on multiple datasets, including LongBench, ZeroScrolls, Arxiv Articles, and a newly constructed novel test set. Experimental results show that LanguaShrink maintains semantic similarity while achieving up to 26 times compression. Compared to existing prompt compression methods, LanguaShrink improves end-to-end latency by 1.43 times.

LanguaShrink: Reducing Token Overhead with Psycholinguistics

TL;DR

Abstract

Paper Structure (30 sections, 4 equations, 2 figures, 6 tables, 2 algorithms)

This paper contains 30 sections, 4 equations, 2 figures, 6 tables, 2 algorithms.

Introduction
Related work
Psycholinguistics
Prompt Compression
Method
POS Priority Compression
Dataset Distillation
Prompt Compress-RLFT
Reward Design
Tuning
Chunk-Based Compression
Experiment
Settings
Main Results
Ablation Study
...and 15 more sections

Figures (2)

Figure 1: Illustration of the Plug-and-Play Document Module. The document encoding is decoupled from specific tasks. By inserting the document plugin into the task model, we can separate compressed text from downstream task reasoning and reduce computational costs.
Figure 2: (a) Data distillation. Initial text compression is first performed using POS priority compression. Next, the compressed prompts are evaluated based on the similarity and compression ratio between the compressed prompt and the original prompt. If the similarity is above the threshold, the model receives a reward; otherwise, the reward is zero and it is filtered out. Then, the model is fine-tuned using Maximum Likelihood Estimation (MLE), and finally, the compressor generates the compressed prompts. (b) Inference. The application of the compressor in actual question-answering tasks is demonstrated. The effect of LinguaShrink compression processing on the original dialogue is shown. Red indicates the parts that are most likely to be compressed, blue indicates the parts that are next most likely to be compressed.

LanguaShrink: Reducing Token Overhead with Psycholinguistics

TL;DR

Abstract

LanguaShrink: Reducing Token Overhead with Psycholinguistics

Authors

TL;DR

Abstract

Table of Contents

Figures (2)