Abstract Art Interpretation Using ControlNet

Rishabh Srivastava; Addrish Roy

Abstract Art Interpretation Using ControlNet

Rishabh Srivastava, Addrish Roy

TL;DR

The paper presents a geometry-guided ControlNet extension to text-to-image diffusion to achieve finer spatial control for abstract-art interpretation. By crafting a triangle-based conditioning signal and training on a WIT-derived dataset with BLIP captions, the authors demonstrate robust preservation of object locations and flexible interpretation via prompts, though color fidelity remains a limitation. Key contributions include the dataset construction (14,279 pairs with 50-triangle priors), a detailed ControlNet integration scheme with zero convolutions, and empirical observations of training dynamics such as the sudden convergence phenomenon. The work advances controllable diffusion for abstract-art applications and outlines concrete paths for improving geometric diversity and quantitative evaluation.

Abstract

Our study delves into the fusion of abstract art interpretation and text-to-image synthesis, addressing the challenge of achieving precise spatial control over image composition solely through textual prompts. Leveraging the capabilities of ControlNet, we empower users with finer control over the synthesis process, enabling enhanced manipulation of synthesized imagery. Inspired by the minimalist forms found in abstract artworks, we introduce a novel condition crafted from geometric primitives such as triangles.

Abstract Art Interpretation Using ControlNet

TL;DR

Abstract

Paper Structure (10 sections, 3 equations, 4 figures)

This paper contains 10 sections, 3 equations, 4 figures.

Introduction
Related Work
Methodology
Dataset Preparation
ControlNet Architecture
Training
Inference
Experimental Results
Discussion
Conclusion

Figures (4)

Figure 1: Representation of our dataset generation pipeline.
Figure 2: To incorporate a ControlNet into the block illustrated in (a), the original block is locked and a trainable duplicate is generated as the first step. These two blocks are then linked using zero convolution layers, specifically employing 1 $\times$ 1 convolutions with weight and bias parameters set to 0. Here, the conditioning vector $c$ represents additional information we aim to integrate into the network, as depicted in (b).
Figure 3: Sudden convergence phenomenon: ControlNet consistently produces high-quality images throughout training, with a marked instance (e.g., bolded step 7361) where it abruptly starts intersecting with the control image. Although the intersection of the generated image with the target image increases at step 7361, the colors do not match completely.
Figure 4: An example of how the same abstract image can be interpreted differently.

Abstract Art Interpretation Using ControlNet

TL;DR

Abstract

Abstract Art Interpretation Using ControlNet

Authors

TL;DR

Abstract

Table of Contents

Figures (4)