Table of Contents
Fetching ...

Transformers from Compressed Representations

Juan C. Leon Alcazar, Mattia Soldan, Mohammad Saatialsoruji, Alejandro Pardo, Hani Itani, Juan Camilo Perez, Bernard Ghanem

TL;DR

TEMPEST tackles the challenge of long transformer input sequences for multimedia data by tokenizing compressed file formats at the block level rather than operating on raw bytes. It learns compact block embeddings with an intra-block encoder and then performs semantic learning with a ViT-like classifier on the sequence of block embeddings, jointly optimizing a reconstruction loss and a classification loss. Across audio (MP3, Opus) and image (JPEG) modalities, TEMPEST achieves competitive accuracy with substantial reductions in token counts and attention-related memory/compute, and benefits from multi-bit-rate data augmentation during training and inference. This approach leverages the inherent structure of CFFs to enable efficient, scalable semantic modelling without full decompression, with practical impact for large-scale media understanding and retrieval tasks.

Abstract

Compressed file formats are the corner stone of efficient data storage and transmission, yet their potential for representation learning remains largely underexplored. We introduce TEMPEST (TransformErs froM comPressed rEpreSenTations), a method that exploits the inherent byte-stream structure of compressed files to design an effective tokenization and encoding strategy. By leveraging this compact encoding, a standard transformer can directly learn semantic representations from compressed data streams, bypassing the need for raw byte-level processing or full media decoding. Our proposal substantially reduces the number of tokens required for semantic classification, thereby lowering both computational complexity and memory usage. Through extensive experiments across diverse datasets, coding schemes, and modalities, we show that TEMPEST achieves accuracy competitive wit the state-of-the-art while delivering efficiency gains in memory and compute.

Transformers from Compressed Representations

TL;DR

TEMPEST tackles the challenge of long transformer input sequences for multimedia data by tokenizing compressed file formats at the block level rather than operating on raw bytes. It learns compact block embeddings with an intra-block encoder and then performs semantic learning with a ViT-like classifier on the sequence of block embeddings, jointly optimizing a reconstruction loss and a classification loss. Across audio (MP3, Opus) and image (JPEG) modalities, TEMPEST achieves competitive accuracy with substantial reductions in token counts and attention-related memory/compute, and benefits from multi-bit-rate data augmentation during training and inference. This approach leverages the inherent structure of CFFs to enable efficient, scalable semantic modelling without full decompression, with practical impact for large-scale media understanding and retrieval tasks.

Abstract

Compressed file formats are the corner stone of efficient data storage and transmission, yet their potential for representation learning remains largely underexplored. We introduce TEMPEST (TransformErs froM comPressed rEpreSenTations), a method that exploits the inherent byte-stream structure of compressed files to design an effective tokenization and encoding strategy. By leveraging this compact encoding, a standard transformer can directly learn semantic representations from compressed data streams, bypassing the need for raw byte-level processing or full media decoding. Our proposal substantially reduces the number of tokens required for semantic classification, thereby lowering both computational complexity and memory usage. Through extensive experiments across diverse datasets, coding schemes, and modalities, we show that TEMPEST achieves accuracy competitive wit the state-of-the-art while delivering efficiency gains in memory and compute.

Paper Structure

This paper contains 23 sections, 5 equations, 1 figure, 8 tables.

Figures (1)

  • Figure 1: TEMPEST Architecture. TEMPEST consists of three sub-networks: the block embedding network (green), the classification network (blue), and the block reconstruction network (orange). The input to TEMPEST is a compressed byte stream (gray and red), which is split into sub-components (compressed data blocks) according to the special byte markers defined in the CFF (the byte values shown in red and are only for illustration). Each compressed block is mapped to an embedding (purple), whose representation is regularized by the reconstruction network (orange). The classification network is a ViT-like architecture: it prepends a [CLS] token and produces the final classification from the sequence of embedded blocks.