Table of Contents
Fetching ...

Learned Compression for Compressed Learning

Dan Jacobellis, Neeraja J. Yadwadkar

TL;DR

WaLLoC tackles the problem of efficiently performing learning on high-resolution data by enabling compressed-domain inference. It introduces a wavelet packet transform based framework that sandwiches a shallow linear analysis transform with a nonlinear synthesis autoencoder between an invertible wavelet transform, achieving efficient encoding, high compression ratios, and uniform dimensionality reduction without relying on perceptual or adversarial losses. Empirical results demonstrate substantial gains over existing autoencoder-based codecs in RGB image and stereo audio settings, and show strong improvements in downstream tasks such as image classification, colorization, document understanding, and music source separation while maintaining low encoding overhead. The approach is modality-agnostic and hardware-friendly, making it well suited for mobile sensing, remote sensing, and direct learning from compressed data; code and models are publicly available.

Abstract

Modern sensors produce increasingly rich streams of high-resolution data. Due to resource constraints, machine learning systems discard the vast majority of this information via resolution reduction. Compressed-domain learning allows models to operate on compact latent representations, allowing higher effective resolution for the same budget. However, existing compression systems are not ideal for compressed learning. Linear transform coding and end-to-end learned compression systems reduce bitrate, but do not uniformly reduce dimensionality; thus, they do not meaningfully increase efficiency. Generative autoencoders reduce dimensionality, but their adversarial or perceptual objectives lead to significant information loss. To address these limitations, we introduce WaLLoC (Wavelet Learned Lossy Compression), a neural codec architecture that combines linear transform coding with nonlinear dimensionality-reducing autoencoders. WaLLoC sandwiches a shallow, asymmetric autoencoder and entropy bottleneck between an invertible wavelet packet transform. Across several key metrics, WaLLoC outperforms the autoencoders used in state-of-the-art latent diffusion models. WaLLoC does not require perceptual or adversarial losses to represent high-frequency detail, providing compatibility with modalities beyond RGB images and stereo audio. WaLLoC's encoder consists almost entirely of linear operations, making it exceptionally efficient and suitable for mobile computing, remote sensing, and learning directly from compressed data. We demonstrate WaLLoC's capability for compressed-domain learning across several tasks, including image classification, colorization, document understanding, and music source separation. Our code, experiments, and pre-trained audio and image codecs are available at https://ut-sysml.org/walloc

Learned Compression for Compressed Learning

TL;DR

WaLLoC tackles the problem of efficiently performing learning on high-resolution data by enabling compressed-domain inference. It introduces a wavelet packet transform based framework that sandwiches a shallow linear analysis transform with a nonlinear synthesis autoencoder between an invertible wavelet transform, achieving efficient encoding, high compression ratios, and uniform dimensionality reduction without relying on perceptual or adversarial losses. Empirical results demonstrate substantial gains over existing autoencoder-based codecs in RGB image and stereo audio settings, and show strong improvements in downstream tasks such as image classification, colorization, document understanding, and music source separation while maintaining low encoding overhead. The approach is modality-agnostic and hardware-friendly, making it well suited for mobile sensing, remote sensing, and direct learning from compressed data; code and models are publicly available.

Abstract

Modern sensors produce increasingly rich streams of high-resolution data. Due to resource constraints, machine learning systems discard the vast majority of this information via resolution reduction. Compressed-domain learning allows models to operate on compact latent representations, allowing higher effective resolution for the same budget. However, existing compression systems are not ideal for compressed learning. Linear transform coding and end-to-end learned compression systems reduce bitrate, but do not uniformly reduce dimensionality; thus, they do not meaningfully increase efficiency. Generative autoencoders reduce dimensionality, but their adversarial or perceptual objectives lead to significant information loss. To address these limitations, we introduce WaLLoC (Wavelet Learned Lossy Compression), a neural codec architecture that combines linear transform coding with nonlinear dimensionality-reducing autoencoders. WaLLoC sandwiches a shallow, asymmetric autoencoder and entropy bottleneck between an invertible wavelet packet transform. Across several key metrics, WaLLoC outperforms the autoencoders used in state-of-the-art latent diffusion models. WaLLoC does not require perceptual or adversarial losses to represent high-frequency detail, providing compatibility with modalities beyond RGB images and stereo audio. WaLLoC's encoder consists almost entirely of linear operations, making it exceptionally efficient and suitable for mobile computing, remote sensing, and learning directly from compressed data. We demonstrate WaLLoC's capability for compressed-domain learning across several tasks, including image classification, colorization, document understanding, and music source separation. Our code, experiments, and pre-trained audio and image codecs are available at https://ut-sysml.org/walloc

Paper Structure

This paper contains 22 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: In discriminative models (left), resolution reduction increases training and inference efficiency, but significantly degrades accuracy. Replacing resolution reduction with WaLLoC leads to significantly higher accuracy, while providing the same degree of acceleration. For signal enhancement (right), WaLLoC provides better quality when scaling to high resolutions compared to directly operating on image pixels or audio samples.
  • Figure 2: Comparison of our proposed method (WaLLoC) with other autoencoder designs for RGB Images (Cheng2020 cheng2020learned, Stable Diffusion 3 esser2024scaling) and stereo audio (EnCodec defossez2022high, Stable Audio evans2024stable). Additional metrics are reported in Tables \ref{['tab:RGB']} and \ref{['tab:stereo']}.
  • Figure 3: WaLLoC's encode-decode pipeline. The entropy bottleneck and entropy coding steps are only required to achieve high compression ratios for storage and transmission. For compressed-domain learning where dimensionality reduction is the primary goal, these steps can be skipped to reduce overhead and completely eliminate CPU-GPU transfers.
  • Figure 4: Example of forward and inverse WPT with $J=2$ levels. Each level applies filters $\text{L}_{\text{A}}$ and $\text{H}_{\text{A}}$ independently to each of the signal channels, followed by downsampling by a factor of two $\left(\downarrow 2\right)$. An inverse level consists of upsampling $\left(\uparrow 2\right)$ followed by $\text{L}_{\text{S}}$ and $\text{H}_{\text{S}}$, then summing the two channels. The full WPT $\stackrel{\sim}{{\textbf{X}}}$ of consists of $J$ levels.
  • Figure 5: Cheng et al. 2020 cheng2020learned
  • ...and 6 more figures