Table of Contents
Fetching ...

VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression

Kyle Sargent, Ruiqi Gao, Philipp Henzler, Charles Herrmann, Aleksander Holynski, Li Fei-Fei, Jiajun Wu, Jason Zhang

TL;DR

This work tackles the misalignment between traditional distortion measures and human perceptual quality in image compression. It introduces VLIC, a diffusion-based compressor post-trained with Vision-Language Model judgments (via Diffusion DPO), leveraging zero-shot VLM reasoning to guide reconstructions toward human-perceived similarity. Across MS-COCO and CLIC benchmarks, VLIC achieves competitive or state-of-the-art perceptual performance, with strong gains when VLM rewards are ensembled with LPIPS, and thorough analyses of reward design and noise mitigation. The study demonstrates that zero-shot perceptual priors from VLMs can effectively steer compression toward human-aligned quality, offering a scalable avenue as VLMs continue to improve, while also outlining limitations and practical considerations like latency and reward noise.

Abstract

Evaluations of image compression performance which include human preferences have generally found that naive distortion functions such as MSE are insufficiently aligned to human perception. In order to align compression models to human perception, prior work has employed differentiable perceptual losses consisting of neural networks calibrated on large-scale datasets of human psycho-visual judgments. We show that, surprisingly, state-of-the-art vision-language models (VLMs) can replicate binary human two-alternative forced choice (2AFC) judgments zero-shot when asked to reason about the differences between pairs of images. Motivated to exploit the powerful zero-shot visual reasoning capabilities of VLMs, we propose Vision-Language Models for Image Compression (VLIC), a diffusion-based image compression system designed to be post-trained with binary VLM judgments. VLIC leverages existing techniques for diffusion model post-training with preferences, rather than distilling the VLM judgments into a separate perceptual loss network. We show that calibrating this system on VLM judgments produces competitive or state-of-the-art performance on human-aligned visual compression depending on the dataset, according to perceptual metrics and large-scale user studies. We additionally conduct an extensive analysis of the VLM-based reward design and training procedure and share important insights. More visuals are available at https://kylesargent.github.io/vlic

VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression

TL;DR

This work tackles the misalignment between traditional distortion measures and human perceptual quality in image compression. It introduces VLIC, a diffusion-based compressor post-trained with Vision-Language Model judgments (via Diffusion DPO), leveraging zero-shot VLM reasoning to guide reconstructions toward human-perceived similarity. Across MS-COCO and CLIC benchmarks, VLIC achieves competitive or state-of-the-art perceptual performance, with strong gains when VLM rewards are ensembled with LPIPS, and thorough analyses of reward design and noise mitigation. The study demonstrates that zero-shot perceptual priors from VLMs can effectively steer compression toward human-aligned quality, offering a scalable avenue as VLMs continue to improve, while also outlining limitations and practical considerations like latency and reward noise.

Abstract

Evaluations of image compression performance which include human preferences have generally found that naive distortion functions such as MSE are insufficiently aligned to human perception. In order to align compression models to human perception, prior work has employed differentiable perceptual losses consisting of neural networks calibrated on large-scale datasets of human psycho-visual judgments. We show that, surprisingly, state-of-the-art vision-language models (VLMs) can replicate binary human two-alternative forced choice (2AFC) judgments zero-shot when asked to reason about the differences between pairs of images. Motivated to exploit the powerful zero-shot visual reasoning capabilities of VLMs, we propose Vision-Language Models for Image Compression (VLIC), a diffusion-based image compression system designed to be post-trained with binary VLM judgments. VLIC leverages existing techniques for diffusion model post-training with preferences, rather than distilling the VLM judgments into a separate perceptual loss network. We show that calibrating this system on VLM judgments produces competitive or state-of-the-art performance on human-aligned visual compression depending on the dataset, according to perceptual metrics and large-scale user studies. We additionally conduct an extensive analysis of the VLM-based reward design and training procedure and share important insights. More visuals are available at https://kylesargent.github.io/vlic

Paper Structure

This paper contains 30 sections, 5 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Left. We propose a new post-training technique for diffusion autoencoders which uses Vision-Language Models to judge different decodings of the same image and leverage these judgements to improve the autoencoder through Diffusion DPO. Right. Our method, VLIC, demonstrates substantial improvements in the overall reconstruction quality, as well as better alignment to human perception.
  • Figure 2: Qualitative results on standard image compression datasets.Top: We compare VLIC with HiFiC mentzer2020high and PO-ELIC he2022poelic on a CLIC 2022 image CLIC2022Dataset at various bits per pixel (bpp). Bottom: We compare our approach with HiFiC and PerCo on MS-COCO lin2014microsoft. We find that our approach represents perceptually relevant fine details, faces, and textures more faithfully
  • Figure 3: Method. An original image is encoded to a one-dimensional discrete latent code via an encoder. The discrete code is entropy coded by an auto-regressive language model. The diffusion decoder samples two reconstructions conditioned on the latent code, which are ranked via a VLM. The resulting preference is used to train the full diffusion autoencoder via Diffusion DPO wallace2023diffusion.
  • Figure 4: Quantitative Evaluation on Image Compression Datasets. Overall, VLIC achieves competitive or state-of-the-art performance. VLIC performs particularly well on perceptual metrics and particularly well on MS-COCO, which contains a high percentage of images with human-relevant characteristics such as text and faces.
  • Figure 5: Scaling self-ensembling. The VLM becomes more predictive of human judgment on BAPPS kettunen2019elpips as test-time compute (number of VLM seeds) is scaled.
  • ...and 3 more figures