Table of Contents
Fetching ...

Seq2Seq2Seq: Lossless Data Compression via Discrete Latent Transformers and Reinforcement Learning

Mahdi Khodabandeh, Ghazal Shabani, Arash Yousefi Jordehi, Seyed Abolghasem Mirroshandel

TL;DR

This work addresses lossless data compression by leveraging a discrete, token-based latent representation learned through reinforcement learning on a T5-based seq2seq architecture. The compressor–decompressor pair is trained with an off-policy RL objective, where the reward $r = - (|ar{c}| + \mathcal{L}_D)$ balances compactness with faithful reconstruction, and entropy considerations are framed via $H(X) = -\sum_i p(x_i) \log p(x_i)$. The approach preserves token structure rather than dense latent vectors, enabling practical deployment on consumer hardware while adapting compression strategies to data without external world knowledge. On enwik8, the method achieves a compression ratio of 4.12, outperforming traditional codecs like XZ and GZIP but still lagging the neural state-of-the-art nncp, illustrating a favorable trade-off between efficiency and compute. The work highlights modular deployment, scalability, and avenues for future improvements such as adaptive chunking, memory-efficient attention, and integrating live compressor–decompressor feedback to further enhance performance.

Abstract

Efficient lossless compression is essential for minimizing storage costs and transmission overhead while preserving data integrity. Traditional compression techniques, such as dictionary-based and statistical methods, often struggle to optimally exploit the structure and redundancy in complex data formats. Recent advancements in deep learning have opened new avenues for compression; however, many existing approaches depend on dense vector representations that obscure the underlying token structure. To address these limitations, we propose a novel lossless compression method that leverages Reinforcement Learning applied to a T5 language model architecture. This approach enables the compression of data into sequences of tokens rather than traditional vector representations. Unlike auto-encoders, which typically encode information into continuous latent spaces, our method preserves the token-based structure, aligning more closely with the original data format. This preservation allows for higher compression ratios while maintaining semantic integrity. By training the model using an off-policy Reinforcement Learning algorithm, we optimize sequence length to minimize redundancy and enhance compression efficiency. Our method introduces an efficient and adaptive data compression system built upon advanced Reinforcement Learning techniques, functioning independently of external grammatical or world knowledge. This approach shows significant improvements in compression ratios compared to conventional methods. By leveraging the latent information within language models, our system effectively compresses data without requiring explicit content understanding, paving the way for more robust and practical compression solutions across various applications.

Seq2Seq2Seq: Lossless Data Compression via Discrete Latent Transformers and Reinforcement Learning

TL;DR

This work addresses lossless data compression by leveraging a discrete, token-based latent representation learned through reinforcement learning on a T5-based seq2seq architecture. The compressor–decompressor pair is trained with an off-policy RL objective, where the reward balances compactness with faithful reconstruction, and entropy considerations are framed via . The approach preserves token structure rather than dense latent vectors, enabling practical deployment on consumer hardware while adapting compression strategies to data without external world knowledge. On enwik8, the method achieves a compression ratio of 4.12, outperforming traditional codecs like XZ and GZIP but still lagging the neural state-of-the-art nncp, illustrating a favorable trade-off between efficiency and compute. The work highlights modular deployment, scalability, and avenues for future improvements such as adaptive chunking, memory-efficient attention, and integrating live compressor–decompressor feedback to further enhance performance.

Abstract

Efficient lossless compression is essential for minimizing storage costs and transmission overhead while preserving data integrity. Traditional compression techniques, such as dictionary-based and statistical methods, often struggle to optimally exploit the structure and redundancy in complex data formats. Recent advancements in deep learning have opened new avenues for compression; however, many existing approaches depend on dense vector representations that obscure the underlying token structure. To address these limitations, we propose a novel lossless compression method that leverages Reinforcement Learning applied to a T5 language model architecture. This approach enables the compression of data into sequences of tokens rather than traditional vector representations. Unlike auto-encoders, which typically encode information into continuous latent spaces, our method preserves the token-based structure, aligning more closely with the original data format. This preservation allows for higher compression ratios while maintaining semantic integrity. By training the model using an off-policy Reinforcement Learning algorithm, we optimize sequence length to minimize redundancy and enhance compression efficiency. Our method introduces an efficient and adaptive data compression system built upon advanced Reinforcement Learning techniques, functioning independently of external grammatical or world knowledge. This approach shows significant improvements in compression ratios compared to conventional methods. By leveraging the latent information within language models, our system effectively compresses data without requiring explicit content understanding, paving the way for more robust and practical compression solutions across various applications.
Paper Structure (16 sections, 5 equations, 6 figures, 1 table, 1 algorithm)

This paper contains 16 sections, 5 equations, 6 figures, 1 table, 1 algorithm.

Figures (6)

  • Figure 1: Structural overview of the compression network. The encoder processes an input sequence of length $S$, generating embeddings of size $S \times E$. The decoder reconstructs the original input from the compressed sequence of length $L$, using both the generated representation and learned embeddings. The value head estimates compression quality, while the policy head predicts the next token.
  • Figure 2: A2C implementation. The language modeling head is used as actor to choose tokens. The value head guides the actor by estimating value.
  • Figure 3: A conceptual view of our architecture. In this diagram the gray boxes represent Seq2Seq models which automatically generate a uniform token representation for both compressor and decompressor. The left box performs A2C algorithm to compress the input data and the right box tries to reproduce the original input from the compressed data.
  • Figure 4: Compression ratio comparison across different chunk sizes.
  • Figure 5: Compression latency vs. chunk size. Time is measured for a single batch.
  • ...and 1 more figures