Table of Contents
Fetching ...

Text Compression for Efficient Language Generation

David Gu, Peter Belcak, Roger Wattenhofer

TL;DR

GPTHF addresses the compute bottleneck of autoregressive language models by replacing sub-word token embeddings with sentence-level embeddings and a hierarchical transformer that uses block-local attention. It preserves GPT-like architecture with a two-tier setup (word-level encoder and sentence-level body) and a fast generation method that caches sentence embeddings to reduce computation, achieving up to 10x FLOPs and 3x runtime improvements in our low-parameter regime. While perplexity increases modestly compared with full baselines, GPTHF follows the scaling laws and demonstrates substantial efficiency gains that scale with context length. The results suggest sentence-level compression is a viable direction for efficient generation at low compute, with future work needed to scale up and integrate with caching optimizations.

Abstract

We challenge the prevailing assumption that LLMs must rely fully on sub-word tokens for high-quality text generation. To this end, we propose the "Generative Pretrained Thoughtformer" (GPTHF), a hierarchical transformer language model capable of text generation by compressing text into sentence embeddings and employing a sentence attention mechanism. GPTHF retains GPT's architecture, modifying only token interactions via dynamic sparse attention masks. Our experiments show that GPTHF achieves an up to an order of magnitude improvement in FLOPs efficiency and a threefold increase in runtime speed compared to equally-sized GPT models in the low-size regime. This is achieved through a unique generation method that caches and reuses sentence embeddings, allowing significant portions of the input to bypass large parts of the network.

Text Compression for Efficient Language Generation

TL;DR

GPTHF addresses the compute bottleneck of autoregressive language models by replacing sub-word token embeddings with sentence-level embeddings and a hierarchical transformer that uses block-local attention. It preserves GPT-like architecture with a two-tier setup (word-level encoder and sentence-level body) and a fast generation method that caches sentence embeddings to reduce computation, achieving up to 10x FLOPs and 3x runtime improvements in our low-parameter regime. While perplexity increases modestly compared with full baselines, GPTHF follows the scaling laws and demonstrates substantial efficiency gains that scale with context length. The results suggest sentence-level compression is a viable direction for efficient generation at low compute, with future work needed to scale up and integrate with caching optimizations.

Abstract

We challenge the prevailing assumption that LLMs must rely fully on sub-word tokens for high-quality text generation. To this end, we propose the "Generative Pretrained Thoughtformer" (GPTHF), a hierarchical transformer language model capable of text generation by compressing text into sentence embeddings and employing a sentence attention mechanism. GPTHF retains GPT's architecture, modifying only token interactions via dynamic sparse attention masks. Our experiments show that GPTHF achieves an up to an order of magnitude improvement in FLOPs efficiency and a threefold increase in runtime speed compared to equally-sized GPT models in the low-size regime. This is achieved through a unique generation method that caches and reuses sentence embeddings, allowing significant portions of the input to bypass large parts of the network.

Paper Structure

This paper contains 25 sections, 2 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Visualization of block attention masks for a text with sentence index vector $[0,0,1,1,1]$. (a) A block matrix allowing attention within sentences. (b) Block lower triangular matrix allowing attention to previous tokens within sentences during training.
  • Figure 2: Overview of the Generative THF (GPTHF) Architecture during inference. The boxes in the models indicate the type of attention masks used. The attention masks are explained in \ref{['fig:block-masks']}.
  • Figure 3: Overview of the pre-training procedure. The boxes in the models indicate the type of attention masks used. The attention masks are explained in \ref{['fig:generative_train_masks']}.
  • Figure 4: Attention masks during pre-training for an input with the sentence index vector [0,0,1,1,1]: The left matrix is the "block triangular mask" as in \ref{['sec:block_masks']}. After going through the encoder, every token represents the compressed prefix of its sequence up to itself, and is only allowed to attend to itself and compressions of previous sequences (right).
  • Figure 5: Illustration of the Fast Generation Algorithm. Having finished $s_1$ and $s_2$ in the context, any subsequent token mathematically cannot influence $e_1, e_2$. The Fast Generation Algorithm caches them and feeds them directly to the slt_body, together with $e_3$.
  • ...and 3 more figures