Table of Contents
Fetching ...

Enhancing Image Aesthetics with Dual-Conditioned Diffusion Models Guided by Multimodal Perception

Xinyu Nan, Ning Wang, Yuyao Zhai, Mei Yang

Abstract

Image aesthetic enhancement aims to perceive aesthetic deficiencies in images and perform corresponding editing operations, which is highly challenging and requires the model to possess creativity and aesthetic perception capabilities. Although recent advancements in image editing models have significantly enhanced their controllability and flexibility, they struggle with enhancing image aesthetic. The primary challenges are twofold: first, following editing instructions with aesthetic perception is difficult, and second, there is a scarcity of "perfectly-paired" images that have consistent content but distinct aesthetic qualities. In this paper, we propose Dual-supervised Image Aesthetic Enhancement (DIAE), a diffusion-based generative model with multimodal aesthetic perception. First, DIAE incorporates Multimodal Aesthetic Perception (MAP) to convert the ambiguous aesthetic instruction into explicit guidance by (i) employing detailed, standardized aesthetic instructions across multiple aesthetic attributes, and (ii) utilizing multimodal control signals derived from text-image pairs that maintain consistency within the same aesthetic attribute. Second, to mitigate the lack of "perfectly-paired" images, we collect "imperfectly-paired" dataset called IIAEData, consisting of images with varying aesthetic qualities while sharing identical semantics. To better leverage the weak matching characteristics of IIAEData during training, a dual-branch supervision framework is also introduced for weakly supervised image aesthetic enhancement. Experimental results demonstrate that DIAE outperforms the baselines and obtains superior image aesthetic scores and image content consistency scores.

Enhancing Image Aesthetics with Dual-Conditioned Diffusion Models Guided by Multimodal Perception

Abstract

Image aesthetic enhancement aims to perceive aesthetic deficiencies in images and perform corresponding editing operations, which is highly challenging and requires the model to possess creativity and aesthetic perception capabilities. Although recent advancements in image editing models have significantly enhanced their controllability and flexibility, they struggle with enhancing image aesthetic. The primary challenges are twofold: first, following editing instructions with aesthetic perception is difficult, and second, there is a scarcity of "perfectly-paired" images that have consistent content but distinct aesthetic qualities. In this paper, we propose Dual-supervised Image Aesthetic Enhancement (DIAE), a diffusion-based generative model with multimodal aesthetic perception. First, DIAE incorporates Multimodal Aesthetic Perception (MAP) to convert the ambiguous aesthetic instruction into explicit guidance by (i) employing detailed, standardized aesthetic instructions across multiple aesthetic attributes, and (ii) utilizing multimodal control signals derived from text-image pairs that maintain consistency within the same aesthetic attribute. Second, to mitigate the lack of "perfectly-paired" images, we collect "imperfectly-paired" dataset called IIAEData, consisting of images with varying aesthetic qualities while sharing identical semantics. To better leverage the weak matching characteristics of IIAEData during training, a dual-branch supervision framework is also introduced for weakly supervised image aesthetic enhancement. Experimental results demonstrate that DIAE outperforms the baselines and obtains superior image aesthetic scores and image content consistency scores.
Paper Structure (11 sections, 6 equations, 7 figures, 2 tables)

This paper contains 11 sections, 6 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Examples of image aesthetic enhancement. Given the original image, image content and aesthetic descriptions (left), DIAE generates the results with enhanced aesthetic (right). Our DIAE is capable of generating images that are content-consistent with input images while possessing enhanced aesthetics.
  • Figure 2: Comparison between "imperfectly-paired' and "perfectly-paired" brooks2022instructpix2pix
  • Figure 3: Overview of DIAE. IIAEData Collection: collecting 'imperfectly-paired" data for IIAEData, obtaining the image pairs matching and aesthetic assessment prompts through LLaVA-13b liu2023llava and UNIAA-LLaVA zhou2024uniaa. Multimodal Aesthetic Perception (MAP): multimodal aesthetic perception with textual descriptions and HSV and contour maps for image color and image structure. Model Optimization: weakly-supervised diffusion model training strategy with 'imperfectly-paired" input and reference image, while using MAP through ControlNet zhang2023adding.
  • Figure 4: Guiding DIAE via MAP. (a) Conditioning embeddings $cond$ generation. (b) Using $cond$ to guide diffusion models via ControlNet zhang2023adding.
  • Figure 5: Examples of generative results from low aesthetics quality batch (MOS $<4.0$). The examples with resolution of $512\times512$, while the Mean Aesthetic Score (MAS) below is the LAIONs-based Score calculated on the batch.
  • ...and 2 more figures