Table of Contents
Fetching ...

Multimodal Semantic-Aware Automatic Colorization with Diffusion Prior

Han Wang, Xinning Chai, Yiwen Wang, Yuhong Zhang, Rong Xie, Li Song

TL;DR

The paper tackles semantic color errors and desaturation in automatic colorization. It introduces a diffusion-prior framework that conditions in latent space on grayscale input and multimodal high-level semantics, combined with a luminance-aware decoder to preserve details. Key contributions include latent-space diffusion with pixel-level grayscale guidance, a multimodal semantic guidance module leveraging category, caption, and segmentation priors, and a luminance-aware reconstruction path that improves perceptual realism. Experimental results show superior perceptual quality and higher human preference over previous state-of-the-art methods, demonstrating the practical efficacy of diffusion priors for conditional colorization.

Abstract

Colorizing grayscale images offers an engaging visual experience. Existing automatic colorization methods often fail to generate satisfactory results due to incorrect semantic colors and unsaturated colors. In this work, we propose an automatic colorization pipeline to overcome these challenges. We leverage the extraordinary generative ability of the diffusion prior to synthesize color with plausible semantics. To overcome the artifacts introduced by the diffusion prior, we apply the luminance conditional guidance. Moreover, we adopt multimodal high-level semantic priors to help the model understand the image content and deliver saturated colors. Besides, a luminance-aware decoder is designed to restore details and enhance overall visual quality. The proposed pipeline synthesizes saturated colors while maintaining plausible semantics. Experiments indicate that our proposed method considers both diversity and fidelity, surpassing previous methods in terms of perceptual realism and gain most human preference.

Multimodal Semantic-Aware Automatic Colorization with Diffusion Prior

TL;DR

The paper tackles semantic color errors and desaturation in automatic colorization. It introduces a diffusion-prior framework that conditions in latent space on grayscale input and multimodal high-level semantics, combined with a luminance-aware decoder to preserve details. Key contributions include latent-space diffusion with pixel-level grayscale guidance, a multimodal semantic guidance module leveraging category, caption, and segmentation priors, and a luminance-aware reconstruction path that improves perceptual realism. Experimental results show superior perceptual quality and higher human preference over previous state-of-the-art methods, demonstrating the practical efficacy of diffusion priors for conditional colorization.

Abstract

Colorizing grayscale images offers an engaging visual experience. Existing automatic colorization methods often fail to generate satisfactory results due to incorrect semantic colors and unsaturated colors. In this work, we propose an automatic colorization pipeline to overcome these challenges. We leverage the extraordinary generative ability of the diffusion prior to synthesize color with plausible semantics. To overcome the artifacts introduced by the diffusion prior, we apply the luminance conditional guidance. Moreover, we adopt multimodal high-level semantic priors to help the model understand the image content and deliver saturated colors. Besides, a luminance-aware decoder is designed to restore details and enhance overall visual quality. The proposed pipeline synthesizes saturated colors while maintaining plausible semantics. Experiments indicate that our proposed method considers both diversity and fidelity, surpassing previous methods in terms of perceptual realism and gain most human preference.
Paper Structure (13 sections, 7 equations, 5 figures, 2 tables)

This paper contains 13 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: We achieve saturated and semantic plausible colorization for grayscale images surpassing the GAN-based(BigColor kim2022bigcolor), transformer-based(CT$^2$CT2) and diffusion-based(ControlNet zhang2023adding) methods.
  • Figure 2: Overview of the proposed automatic colorization pipeline. It combines a semantic prior generator (blue box), a high-level semantic guided diffusion model(yellow box), and a luminance-aware decoder (orange box).
  • Figure 3: Qualitative comparisons among InstColor su2020instance, ChromaGAN vitoria2020chromagan, BigColor kim2022bigcolor, ColTran kumar2021colorization, CT$^2$CT2, ControlNet zhang2023adding and Ours. More results are provided on https://servuskk.github.io/ColorDiff-Image/.
  • Figure 4: User evaluations.
  • Figure 5: Visual comparison from ablation studies.