SMIC: Semantic Multi-Item Compression based on CLIP dictionary
Tom Bachard, Thomas Maugey
TL;DR
SMIC addresses semantic compression for large image collections by exploiting inter-item semantic redundancy. It leverages CLIP latent space linearity to perform semantic vector arithmetic and learns a semantic latent dictionary that expresses each image as a sparse combination of atoms, enabling a two-stage pipeline: dictionary transmission and projection-based latent reconstruction. The learned dictionary captures high-level concepts and allows generating semantically faithful images with ultra-low bitrate, achieving around $10^{-5}$ BPP per image in experiments and outperforming state-of-the-art single-item codecs. This approach enables efficient semantic-aware storage for data collections and opens paths for semantic quantization and broader use with other foundation models.
Abstract
Semantic compression, a compression scheme where the distortion metric, typically MSE, is replaced with semantic fidelity metrics, tends to become more and more popular. Most recent semantic compression schemes rely on the foundation model CLIP. In this work, we extend such a scheme to image collection compression, where inter-item redundancy is taken into account during the coding phase. For that purpose, we first show that CLIP's latent space allows for easy semantic additions and subtractions. From this property, we define a dictionary-based multi-item codec that outperforms state-of-the-art generative codec in terms of compression rate, around $10^{-5}$ BPP per image, while not sacrificing semantic fidelity. We also show that the learned dictionary is of a semantic nature and works as a semantic projector for the semantic content of images.
