Table of Contents
Fetching ...

Compose Your Aesthetics: Empowering Text-to-Image Models with the Principles of Art

Zhe Jin, Tat-Seng Chua

TL;DR

This work tackles the limitation of universal aesthetics in text-to-image diffusion by introducing aesthetic alignment based on the Principles of Art (PoA). It builds CompArt, a large WikiArt-derived dataset with extensive PoA annotations and captions, annotated by a multimodal LLM to enable robust, user-specified aesthetic controls. The authors propose ArtDapter, a lightweight, transferable adapter that injects PoA-based conditions into latent diffusion models through cross-attention, enabling 10 compositional controls guided by PoA without modifying the base model. An evaluation framework combines GPT-4o annotations and ImageReward scores to assess PoA alignment, showing that ArtDapter can effectively honor PoA conditions and outperform baselines in principle-level alignment, with clear demonstrations of multi-PoA composition. The work highlights a pathway toward personalized, composition-aware generative tools and provides public datasets and code to spur further research in aesthetically guided T2I generation.

Abstract

Text-to-Image (T2I) diffusion models (DM) have garnered widespread adoption due to their capability in generating high-fidelity outputs and accessibility to anyone able to put imagination into words. However, DMs are often predisposed to generate unappealing outputs, much like the random images on the internet they were trained on. Existing approaches to address this are founded on the implicit premise that visual aesthetics is universal, which is limiting. Aesthetics in the T2I context should be about personalization and we propose the novel task of aesthetics alignment which seeks to align user-specified aesthetics with the T2I generation output. Inspired by how artworks provide an invaluable perspective to approach aesthetics, we codify visual aesthetics using the compositional framework artists employ, known as the Principles of Art (PoA). To facilitate this study, we introduce CompArt, a large-scale compositional art dataset building on top of WikiArt with PoA analysis annotated by a capable Multimodal LLM. Leveraging the expressive power of LLMs and training a lightweight and transferrable adapter, we demonstrate that T2I DMs can effectively offer 10 compositional controls through user-specified PoA conditions. Additionally, we design an appropriate evaluation framework to assess the efficacy of our approach.

Compose Your Aesthetics: Empowering Text-to-Image Models with the Principles of Art

TL;DR

This work tackles the limitation of universal aesthetics in text-to-image diffusion by introducing aesthetic alignment based on the Principles of Art (PoA). It builds CompArt, a large WikiArt-derived dataset with extensive PoA annotations and captions, annotated by a multimodal LLM to enable robust, user-specified aesthetic controls. The authors propose ArtDapter, a lightweight, transferable adapter that injects PoA-based conditions into latent diffusion models through cross-attention, enabling 10 compositional controls guided by PoA without modifying the base model. An evaluation framework combines GPT-4o annotations and ImageReward scores to assess PoA alignment, showing that ArtDapter can effectively honor PoA conditions and outperform baselines in principle-level alignment, with clear demonstrations of multi-PoA composition. The work highlights a pathway toward personalized, composition-aware generative tools and provides public datasets and code to spur further research in aesthetically guided T2I generation.

Abstract

Text-to-Image (T2I) diffusion models (DM) have garnered widespread adoption due to their capability in generating high-fidelity outputs and accessibility to anyone able to put imagination into words. However, DMs are often predisposed to generate unappealing outputs, much like the random images on the internet they were trained on. Existing approaches to address this are founded on the implicit premise that visual aesthetics is universal, which is limiting. Aesthetics in the T2I context should be about personalization and we propose the novel task of aesthetics alignment which seeks to align user-specified aesthetics with the T2I generation output. Inspired by how artworks provide an invaluable perspective to approach aesthetics, we codify visual aesthetics using the compositional framework artists employ, known as the Principles of Art (PoA). To facilitate this study, we introduce CompArt, a large-scale compositional art dataset building on top of WikiArt with PoA analysis annotated by a capable Multimodal LLM. Leveraging the expressive power of LLMs and training a lightweight and transferrable adapter, we demonstrate that T2I DMs can effectively offer 10 compositional controls through user-specified PoA conditions. Additionally, we design an appropriate evaluation framework to assess the efficacy of our approach.

Paper Structure

This paper contains 19 sections, 1 equation, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Samples generated with our ArtDapter model, showcasing its ability to adhere to the respective context, art-style and compositional conditions (bottom row) across different Principles of Art (PoA). An extended version of this figure covering all 10 PoA are included in the Appendix (\ref{['fig:principle-wise-cherry-picked']}).
  • Figure 2: CompArt annotation examples. An example is given for every type of annotation in the dataset, namely the artwork caption and the 10 principles of art. For the principle of balance, an example is presented for each sense of it (i.e. asymmetric, symmetric, radial).
  • Figure 3: Principle-wise breakdown of the 637,573 PoA annotations in CompArt. For the annotations on a principle, their proportions of prominence levels (Weak, Mild, Moderate, Strong) is indicated by the respective colored partitions within the bar.
  • Figure 4: Example scorecard to illustrate our evaluation scheme. The left column details the artistic conditions (PoA conditions truncated) for the associated CompArt image and the generations of ArtDapter, ELLA and SDv1.5 outputs. For each image, we score its alignment to each PoA condition using GPT-4o (GPT) and ImageReward DBLP:conf/nips/XuLWTLDTD23 (IR).
  • Figure 5: Principle and image level evaluation results in terms of winning percentages of each model. For each level of evaluation, the results for $\alpha$ and $\beta$ assessments are respectively reported in its top and bottom subplots.
  • ...and 10 more figures