Table of Contents
Fetching ...

Guided Diffusion for the Extension of Machine Vision to Human Visual Perception

Takahiro Shindo, Yui Tatsumi, Taiju Watanabe, Hiroshi Watanabe

TL;DR

The paper tackles the challenge of creating image representations usable by both machines and humans without increasing bitrate. It proposes a guided-diffusion approach where Stable Diffusion, conditioned on a machine-optimized SA-ICM decoding via ControlNet, generates human-perception images from random noise; a Color Controller further aligns colors with the machine decoding. The training objective is $\mathcal{L}=\mathbb{E}_{z_{0},\mathbf{t},\mathbf{c_{t}},\mathbf{c_{f}},\epsilon\sim\mathcal{N}(0,1)}\left[\lVert \epsilon-\epsilon_{\theta}(z_{t},\mathbf{t},\mathbf{c_{t}},\mathbf{c_{f}}) \rVert_{2}^{2}\right]$, enabling noise prediction within the diffusion process. Empirically, the method yields superior perceptual metrics (FID, KID) and texture restoration for human viewers while keeping the machine-side bitrate unchanged, offering a practical bridge between machine vision and human perception and suggesting applicability to other ICM techniques.

Abstract

Image compression technology eliminates redundant information to enable efficient transmission and storage of images, serving both machine vision and human visual perception. For years, image coding focused on human perception has been well-studied, leading to the development of various image compression standards. On the other hand, with the rapid advancements in image recognition models, image compression for AI tasks, known as Image Coding for Machines (ICM), has gained significant importance. Therefore, scalable image coding techniques that address the needs of both machines and humans have become a key area of interest. Additionally, there is increasing demand for research applying the diffusion model, which can generate human-viewable images from a small amount of data to image compression methods for human vision. Image compression methods that use diffusion models can partially reconstruct the target image by guiding the generation process with a small amount of conditioning information. Inspired by the diffusion model's potential, we propose a method for extending machine vision to human visual perception using guided diffusion. Utilizing the diffusion model guided by the output of the ICM method, we generate images for human perception from random noise. Guided diffusion acts as a bridge between machine vision and human vision, enabling transitions between them without any additional bitrate overhead. The generated images then evaluated based on bitrate and image quality, and we compare their compression performance with other scalable image coding methods for humans and machines.

Guided Diffusion for the Extension of Machine Vision to Human Visual Perception

TL;DR

The paper tackles the challenge of creating image representations usable by both machines and humans without increasing bitrate. It proposes a guided-diffusion approach where Stable Diffusion, conditioned on a machine-optimized SA-ICM decoding via ControlNet, generates human-perception images from random noise; a Color Controller further aligns colors with the machine decoding. The training objective is , enabling noise prediction within the diffusion process. Empirically, the method yields superior perceptual metrics (FID, KID) and texture restoration for human viewers while keeping the machine-side bitrate unchanged, offering a practical bridge between machine vision and human perception and suggesting applicability to other ICM techniques.

Abstract

Image compression technology eliminates redundant information to enable efficient transmission and storage of images, serving both machine vision and human visual perception. For years, image coding focused on human perception has been well-studied, leading to the development of various image compression standards. On the other hand, with the rapid advancements in image recognition models, image compression for AI tasks, known as Image Coding for Machines (ICM), has gained significant importance. Therefore, scalable image coding techniques that address the needs of both machines and humans have become a key area of interest. Additionally, there is increasing demand for research applying the diffusion model, which can generate human-viewable images from a small amount of data to image compression methods for human vision. Image compression methods that use diffusion models can partially reconstruct the target image by guiding the generation process with a small amount of conditioning information. Inspired by the diffusion model's potential, we propose a method for extending machine vision to human visual perception using guided diffusion. Utilizing the diffusion model guided by the output of the ICM method, we generate images for human perception from random noise. Guided diffusion acts as a bridge between machine vision and human vision, enabling transitions between them without any additional bitrate overhead. The generated images then evaluated based on bitrate and image quality, and we compare their compression performance with other scalable image coding methods for humans and machines.

Paper Structure

This paper contains 11 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Image processing in scalable image compression methods: (a) ICMH-Net, (b) ICMH-FF, (c) Ours.
  • Figure 2: Image processing flow in the proposed method. The ICM method serves as a condition to generate images optimized for human perception. The generated image is then input into the CC-module, which restores color elements approximating the original image.
  • Figure 3: Examples of original and decoded images: (a) Original image, (b) Decoded image for machine vision using SA-ICM, (c) Decoded image for human vision using ICMH-FF, (d) Decoded image for human perception using the proposed method.
  • Figure 4: Image compression performance of the proposed and comparative methods. Five metrics are used to evaluate image quality for human visual perception: (a) PSNR($\uparrow$), (b) SSIM($\uparrow$), (c) LPIPS($\downarrow$), (d) FID($\downarrow$), and (e) KID($\downarrow$).