Table of Contents
Fetching ...

Audio-Infused Automatic Image Colorization by Exploiting Audio Scene Semantics

Pengcheng Zhao, Yanxiang Chen, Yang Zhao, Zhao Zhang

TL;DR

Automatic image colorization remains ill-posed without strong semantic cues; this work introduces Audio-Infused Automatic Image Colorization (AIAIC), a three-stage, self-supervised framework that leverages audio scene semantics to guide color prediction. A colorization backbone is first trained with scene semantics from color images, then audio scene semantics are learned to align with visual semantics, and finally the audio representations are injected into the pretrained colorization network. The method employs an AdaIN-based semantic guidance module and a Dynamic Semantic Guidance (DSG) mechanism with a learnable relevance score to handle audio-visual misalignment and missing audio, and it is trained without manual audio labels. Experiments on VGGSound and AVE show that incorporating audio improves colorization metrics and color plausibility, including under unknown audio conditions, demonstrating the practicality of multimodal guidance for automatic colorization.

Abstract

Automatic image colorization is inherently an ill-posed problem with uncertainty, which requires an accurate semantic understanding of scenes to estimate reasonable colors for grayscale images. Although recent interaction-based methods have achieved impressive performance, it is still a very difficult task to infer realistic and accurate colors for automatic colorization. To reduce the difficulty of semantic understanding of grayscale scenes, this paper tries to utilize corresponding audio, which naturally contains extra semantic information about the same scene. Specifically, a novel and pluggable audio-infused automatic image colorization (AIAIC) method is proposed, which consists of three stages. First, we take color image semantics as a bridge and pretrain a colorization network guided by color image semantics. Second, the natural co-occurrence of audio and video is utilized to learn the color semantic correlations between audio and visual scenes. Third, the implicit audio semantic representation is fed into the pretrained network to finally realize the audio-guided colorization. The whole process is trained in a self-supervised manner without human annotation. Experiments demonstrate that audio guidance can effectively improve the performance of automatic colorization, especially for some scenes that are difficult to understand only from visual modality.

Audio-Infused Automatic Image Colorization by Exploiting Audio Scene Semantics

TL;DR

Automatic image colorization remains ill-posed without strong semantic cues; this work introduces Audio-Infused Automatic Image Colorization (AIAIC), a three-stage, self-supervised framework that leverages audio scene semantics to guide color prediction. A colorization backbone is first trained with scene semantics from color images, then audio scene semantics are learned to align with visual semantics, and finally the audio representations are injected into the pretrained colorization network. The method employs an AdaIN-based semantic guidance module and a Dynamic Semantic Guidance (DSG) mechanism with a learnable relevance score to handle audio-visual misalignment and missing audio, and it is trained without manual audio labels. Experiments on VGGSound and AVE show that incorporating audio improves colorization metrics and color plausibility, including under unknown audio conditions, demonstrating the practicality of multimodal guidance for automatic colorization.

Abstract

Automatic image colorization is inherently an ill-posed problem with uncertainty, which requires an accurate semantic understanding of scenes to estimate reasonable colors for grayscale images. Although recent interaction-based methods have achieved impressive performance, it is still a very difficult task to infer realistic and accurate colors for automatic colorization. To reduce the difficulty of semantic understanding of grayscale scenes, this paper tries to utilize corresponding audio, which naturally contains extra semantic information about the same scene. Specifically, a novel and pluggable audio-infused automatic image colorization (AIAIC) method is proposed, which consists of three stages. First, we take color image semantics as a bridge and pretrain a colorization network guided by color image semantics. Second, the natural co-occurrence of audio and video is utilized to learn the color semantic correlations between audio and visual scenes. Third, the implicit audio semantic representation is fed into the pretrained network to finally realize the audio-guided colorization. The whole process is trained in a self-supervised manner without human annotation. Experiments demonstrate that audio guidance can effectively improve the performance of automatic colorization, especially for some scenes that are difficult to understand only from visual modality.
Paper Structure (16 sections, 11 equations, 5 figures, 2 tables)

This paper contains 16 sections, 11 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Comparisons with existing methods 383629, which demonstrates that audio can improve the semantic accuracy of the generated colors so that the overall effect matches the real scene situation.
  • Figure 2: The framework of our proposed method for audio-infused automatic image colorization (AIAIC), which is composed of three steps.
  • Figure 3: The framework of designed relevance network (RNet).
  • Figure 4: Visual comparisons with the baselines. Our proposed AIAIC method can generate colors that better conform to the actual scene, e.g., sea wave and diving (second row), while enhancing the colors of the subjects in the scene, such as flame (fourth row) and lion (last row).
  • Figure 5: Qualitative comparisons for demonstrating that the incorporation of audio and multi-step training strategy can effectively complement and enhance the scene semantic understanding for the visual model to generate more accurate colors.