Table of Contents
Fetching ...

SMIC: Semantic Multi-Item Compression based on CLIP dictionary

Tom Bachard, Thomas Maugey

TL;DR

SMIC addresses semantic compression for large image collections by exploiting inter-item semantic redundancy. It leverages CLIP latent space linearity to perform semantic vector arithmetic and learns a semantic latent dictionary that expresses each image as a sparse combination of atoms, enabling a two-stage pipeline: dictionary transmission and projection-based latent reconstruction. The learned dictionary captures high-level concepts and allows generating semantically faithful images with ultra-low bitrate, achieving around $10^{-5}$ BPP per image in experiments and outperforming state-of-the-art single-item codecs. This approach enables efficient semantic-aware storage for data collections and opens paths for semantic quantization and broader use with other foundation models.

Abstract

Semantic compression, a compression scheme where the distortion metric, typically MSE, is replaced with semantic fidelity metrics, tends to become more and more popular. Most recent semantic compression schemes rely on the foundation model CLIP. In this work, we extend such a scheme to image collection compression, where inter-item redundancy is taken into account during the coding phase. For that purpose, we first show that CLIP's latent space allows for easy semantic additions and subtractions. From this property, we define a dictionary-based multi-item codec that outperforms state-of-the-art generative codec in terms of compression rate, around $10^{-5}$ BPP per image, while not sacrificing semantic fidelity. We also show that the learned dictionary is of a semantic nature and works as a semantic projector for the semantic content of images.

SMIC: Semantic Multi-Item Compression based on CLIP dictionary

TL;DR

SMIC addresses semantic compression for large image collections by exploiting inter-item semantic redundancy. It leverages CLIP latent space linearity to perform semantic vector arithmetic and learns a semantic latent dictionary that expresses each image as a sparse combination of atoms, enabling a two-stage pipeline: dictionary transmission and projection-based latent reconstruction. The learned dictionary captures high-level concepts and allows generating semantically faithful images with ultra-low bitrate, achieving around BPP per image in experiments and outperforming state-of-the-art single-item codecs. This approach enables efficient semantic-aware storage for data collections and opens paths for semantic quantization and broader use with other foundation models.

Abstract

Semantic compression, a compression scheme where the distortion metric, typically MSE, is replaced with semantic fidelity metrics, tends to become more and more popular. Most recent semantic compression schemes rely on the foundation model CLIP. In this work, we extend such a scheme to image collection compression, where inter-item redundancy is taken into account during the coding phase. For that purpose, we first show that CLIP's latent space allows for easy semantic additions and subtractions. From this property, we define a dictionary-based multi-item codec that outperforms state-of-the-art generative codec in terms of compression rate, around BPP per image, while not sacrificing semantic fidelity. We also show that the learned dictionary is of a semantic nature and works as a semantic projector for the semantic content of images.

Paper Structure

This paper contains 18 sections, 15 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Multi-item compression with individual image coding. $\mathop{\mathrm{\mathcal{I}_{\mathcal{X}}}}\nolimits$ describes the database's statistics used for individual encoding and decoding.
  • Figure 2: Generative compression. The generated images are evaluated in terms of semantic fidelity and visual quality.
  • Figure 3: Semantic Multi-item compression. $\mathop{\mathrm{\mathcal{I}_{\mathcal{X}}}}\nolimits$ describes the database's statistics used for individual encoding and decoding.
  • Figure 4: Progressively adding people to the landscape from kodak (Left) Input images $\mathop{\mathrm{\mathbf{x}}}\nolimits_1$ and $\mathop{\mathrm{\mathbf{x}}}\nolimits_2$. (Right) Top to bottom, left to right: Images generated from $f(\mathop{\mathrm{\mathbf{x}}}\nolimits_1)+\alpha f(\mathop{\mathrm{\mathbf{x}}}\nolimits_2)$. Where $\alpha=i/4,\ i\in [1...8]$.
  • Figure 5: Progressively removing the river from the landscape from kodak. (Left) Input images $\mathop{\mathrm{\mathbf{x}}}\nolimits_1$ and $\mathop{\mathrm{\mathbf{x}}}\nolimits_2$. (Right) Left to right: Images generated from $f(\mathop{\mathrm{\mathbf{x}}}\nolimits_1)-\alpha f(\mathop{\mathrm{\mathbf{x}}}\nolimits_2)$. Where $\alpha=i/8,\ i\in[1...8]$.
  • ...and 9 more figures