Transformers from Compressed Representations
Juan C. Leon Alcazar, Mattia Soldan, Mohammad Saatialsoruji, Alejandro Pardo, Hani Itani, Juan Camilo Perez, Bernard Ghanem
TL;DR
TEMPEST tackles the challenge of long transformer input sequences for multimedia data by tokenizing compressed file formats at the block level rather than operating on raw bytes. It learns compact block embeddings with an intra-block encoder and then performs semantic learning with a ViT-like classifier on the sequence of block embeddings, jointly optimizing a reconstruction loss and a classification loss. Across audio (MP3, Opus) and image (JPEG) modalities, TEMPEST achieves competitive accuracy with substantial reductions in token counts and attention-related memory/compute, and benefits from multi-bit-rate data augmentation during training and inference. This approach leverages the inherent structure of CFFs to enable efficient, scalable semantic modelling without full decompression, with practical impact for large-scale media understanding and retrieval tasks.
Abstract
Compressed file formats are the corner stone of efficient data storage and transmission, yet their potential for representation learning remains largely underexplored. We introduce TEMPEST (TransformErs froM comPressed rEpreSenTations), a method that exploits the inherent byte-stream structure of compressed files to design an effective tokenization and encoding strategy. By leveraging this compact encoding, a standard transformer can directly learn semantic representations from compressed data streams, bypassing the need for raw byte-level processing or full media decoding. Our proposal substantially reduces the number of tokens required for semantic classification, thereby lowering both computational complexity and memory usage. Through extensive experiments across diverse datasets, coding schemes, and modalities, we show that TEMPEST achieves accuracy competitive wit the state-of-the-art while delivering efficiency gains in memory and compute.
