Table of Contents
Fetching ...

An Enhanced Text Compression Approach Using Transformer-based Language Models

Chowdhury Mofizur Rahman, Mahbub E Sobhani, Anika Tasnim Rodela, Swakkhar Shatabda

TL;DR

This work tackles the problem of efficient, lossless text compression by integrating a novel pre-processing step with transformer-based text restoration. The authors introduce RejuvenateFormer, which uses vowel-removal and LZW-based compression before a six-layer encoder–decoder to recover the original text, achieving state-of-the-art compression ratios across BookCorpus, EN-DE, and EN-FR, and competitive BLEU scores on restoration. Key contributions include the LZW-driven pre-processing, the six-layer transformer architecture, and comprehensive ablations showing corpus size impacts. The approach offers practical gains in storage and bandwidth for large-scale text data and demonstrates a viable path toward scalable, lossless transformer-based decompression.

Abstract

Text compression shrinks textual data while keeping crucial information, eradicating constraints on storage, bandwidth, and computational efficacy. The integration of lossless compression techniques with transformer-based text decompression has received negligible attention, despite the increasing volume of English text data in communication. The primary barrier in advancing text compression and restoration involves optimizing transformer-based approaches with efficient pre-processing and integrating lossless compression algorithms, that remained unresolved in the prior attempts. Here, we propose a transformer-based method named RejuvenateForme for text decompression, addressing prior issues by harnessing a new pre-processing technique and a lossless compression method. Our meticulous pre-processing technique incorporating the Lempel-Ziv-Welch algorithm achieves compression ratios of 12.57, 13.38, and 11.42 on the BookCorpus, EN-DE, and EN-FR corpora, thus showing state-of-the-art compression ratios compared to other deep learning and traditional approaches. Furthermore, the RejuvenateForme achieves a BLEU score of 27.31, 25.78, and 50.45 on the EN-DE, EN-FR, and BookCorpus corpora, showcasing its comprehensive efficacy. In contrast, the pre-trained T5-Small exhibits better performance over prior state-of-the-art models.

An Enhanced Text Compression Approach Using Transformer-based Language Models

TL;DR

This work tackles the problem of efficient, lossless text compression by integrating a novel pre-processing step with transformer-based text restoration. The authors introduce RejuvenateFormer, which uses vowel-removal and LZW-based compression before a six-layer encoder–decoder to recover the original text, achieving state-of-the-art compression ratios across BookCorpus, EN-DE, and EN-FR, and competitive BLEU scores on restoration. Key contributions include the LZW-driven pre-processing, the six-layer transformer architecture, and comprehensive ablations showing corpus size impacts. The approach offers practical gains in storage and bandwidth for large-scale text data and demonstrates a viable path toward scalable, lossless transformer-based decompression.

Abstract

Text compression shrinks textual data while keeping crucial information, eradicating constraints on storage, bandwidth, and computational efficacy. The integration of lossless compression techniques with transformer-based text decompression has received negligible attention, despite the increasing volume of English text data in communication. The primary barrier in advancing text compression and restoration involves optimizing transformer-based approaches with efficient pre-processing and integrating lossless compression algorithms, that remained unresolved in the prior attempts. Here, we propose a transformer-based method named RejuvenateForme for text decompression, addressing prior issues by harnessing a new pre-processing technique and a lossless compression method. Our meticulous pre-processing technique incorporating the Lempel-Ziv-Welch algorithm achieves compression ratios of 12.57, 13.38, and 11.42 on the BookCorpus, EN-DE, and EN-FR corpora, thus showing state-of-the-art compression ratios compared to other deep learning and traditional approaches. Furthermore, the RejuvenateForme achieves a BLEU score of 27.31, 25.78, and 50.45 on the EN-DE, EN-FR, and BookCorpus corpora, showcasing its comprehensive efficacy. In contrast, the pre-trained T5-Small exhibits better performance over prior state-of-the-art models.

Paper Structure

This paper contains 22 sections, 3 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: (Top) Each corpus undergoes vowel removal, and the compression ratio is calculated using the compressed representation. The text is then reverted to its earliest form without vowels. (Bottom) After tokenization, the RejuvenateFormer is trained on each corpus, to proficiently generate expected outcomes.