Table of Contents
Fetching ...

Cognitive Chunking for Soft Prompts: Accelerating Compressor Learning via Block-wise Causal Masking

Guojie Liu, Yiqi Wang, Yanfeng Yang, Wenqi Fan, Songlei Jian, Jianfeng Zhang, Jie Yu

TL;DR

Inspired by the chunking mechanism in human working memory and empirical observations of the spatial specialization of memory embeddings relative to original tokens, Parallelized Iterative Compression (PIC) is proposed, which reduces the difficulty of compressor training by simply modifying the Transformer's attention mask.

Abstract

Providing extensive context via prompting is vital for leveraging the capabilities of Large Language Models (LLMs). However, lengthy contexts significantly increase inference latency, as the computational cost of self-attention grows quadratically with sequence length. To mitigate this issue, context compression-particularly soft prompt compressio-has emerged as a widely studied solution, which converts long contexts into shorter memory embeddings via a trained compressor. Existing methods typically compress the entire context indiscriminately into a set of memory tokens, requiring the compressor to capture global dependencies and necessitating extensive pre-training data to learn effective patterns. Inspired by the chunking mechanism in human working memory and empirical observations of the spatial specialization of memory embeddings relative to original tokens, we propose Parallelized Iterative Compression (PIC). By simply modifying the Transformer's attention mask, PIC explicitly restricts the receptive field of memory tokens to sequential local chunks, thereby lowering the difficulty of compressor training. Experiments across multiple downstream tasks demonstrate that PIC consistently outperforms competitive baselines, with superiority being particularly pronounced in high compression scenarios (e.g., achieving relative improvements of 29.8\% in F1 score and 40.7\% in EM score on QA tasks at the $64\times$ compression ratio). Furthermore, PIC significantly expedites the training process. Specifically, when training the 16$\times$ compressor, it surpasses the peak performance of the competitive baseline while effectively reducing the training time by approximately 40\%.

Cognitive Chunking for Soft Prompts: Accelerating Compressor Learning via Block-wise Causal Masking

TL;DR

Inspired by the chunking mechanism in human working memory and empirical observations of the spatial specialization of memory embeddings relative to original tokens, Parallelized Iterative Compression (PIC) is proposed, which reduces the difficulty of compressor training by simply modifying the Transformer's attention mask.

Abstract

Providing extensive context via prompting is vital for leveraging the capabilities of Large Language Models (LLMs). However, lengthy contexts significantly increase inference latency, as the computational cost of self-attention grows quadratically with sequence length. To mitigate this issue, context compression-particularly soft prompt compressio-has emerged as a widely studied solution, which converts long contexts into shorter memory embeddings via a trained compressor. Existing methods typically compress the entire context indiscriminately into a set of memory tokens, requiring the compressor to capture global dependencies and necessitating extensive pre-training data to learn effective patterns. Inspired by the chunking mechanism in human working memory and empirical observations of the spatial specialization of memory embeddings relative to original tokens, we propose Parallelized Iterative Compression (PIC). By simply modifying the Transformer's attention mask, PIC explicitly restricts the receptive field of memory tokens to sequential local chunks, thereby lowering the difficulty of compressor training. Experiments across multiple downstream tasks demonstrate that PIC consistently outperforms competitive baselines, with superiority being particularly pronounced in high compression scenarios (e.g., achieving relative improvements of 29.8\% in F1 score and 40.7\% in EM score on QA tasks at the compression ratio). Furthermore, PIC significantly expedites the training process. Specifically, when training the 16 compressor, it surpasses the peak performance of the competitive baseline while effectively reducing the training time by approximately 40\%.
Paper Structure (32 sections, 8 equations, 24 figures, 5 tables)

This paper contains 32 sections, 8 equations, 24 figures, 5 tables.

Figures (24)

  • Figure 1: An illustration of unconstrained global attention in a typical soft prompt compression method: both memory tokens need to attend to all the input words simultaneously.
  • Figure 2: Attention weight heatmap between memory tokens and original context tokens. Red indicates higher attention weight, while blue represents lower attention weight.
  • Figure 3: Heatmap visualization of the cosine similarity between generated memory embeddings generated by the fully trained compressor (90k steps) and the embeddings of original context tokens on the HotpotQA dataset. Red indicates higher similarity, while blue represents lower similarity.
  • Figure 4: Comparison of different compression paradigms. (a) Direct Compression: Each memory token attends globally to the entire input context without structural constraints. (b) Iterative Compression: Inspired by the chunking mechanism in human working memory, we split the input context into chunks, and memory tokens are generated sequentially, with each token attending only to its specific chunk and previous memory. (c) Parallelized Iterative Compression (PIC): Our proposed method allows all memory tokens to be processed in parallel within a single sequence, using a block-wise causal mask to enforce the same local dependency constraints as the iterative approach.
  • Figure 5: Comparison of Reconstruction Loss During Pre-training. The upper plot illustrates the overall loss trajectory, while the lower plot presents a zoomed-in view of steps 10k to 30k, highlighting the faster convergence of our PIC (red) compared to the PCC baseline (dashed gray).
  • ...and 19 more figures