Alfie: Democratising RGBA Image Generation With No $$$

Fabio Quattrini; Vittorio Pippi; Silvia Cascianelli; Rita Cucchiara

Alfie: Democratising RGBA Image Generation With No $$$

Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Rita Cucchiara

TL;DR

This paper tackles the challenge of generating high-quality RGBA illustrations with an accurate alpha channel and full-subject containment for design contexts, without additional training costs. It introduces Alfie, a fully automated, prompt-guided pipeline that repurposes a pre-trained Diffusion Transformer (PixArt-$\Sigma$) through inference-time adaptations: subject-centering via a foreground/background latent split and mask-guided blending, and alpha-channel estimation derived from cross- and self-attention maps, followed by foreground cleanup with GrabCut. The authors evaluate Alfie on containment metrics, CLIP-based prompt fidelity, and user preference, reporting strong containment (>95%), CLIP-S near reference, and a 63% user preference over matting, as well as demonstrating compositional scene generation with Collage Diffusion. The results show Alfie can produce ready-to-use RGBA illustrations with minimal cost and effort, enabling straightforward integration into visual designs and automated scene composition pipelines. The work also provides code and an evaluation setup to foster further research into low-cost, inference-time adaptations for RGBA image generation.

Abstract

Designs and artworks are ubiquitous across various creative fields, requiring graphic design skills and dedicated software to create compositions that include many graphical elements, such as logos, icons, symbols, and art scenes, which are integral to visual storytelling. Automating the generation of such visual elements improves graphic designers' productivity, democratizes and innovates the creative industry, and helps generate more realistic synthetic data for related tasks. These illustration elements are mostly RGBA images with irregular shapes and cutouts, facilitating blending and scene composition. However, most image generation models are incapable of generating such images and achieving this capability requires expensive computational resources, specific training recipes, or post-processing solutions. In this work, we propose a fully-automated approach for obtaining RGBA illustrations by modifying the inference-time behavior of a pre-trained Diffusion Transformer model, exploiting the prompt-guided controllability and visual quality offered by such models with no additional computational cost. We force the generation of entire subjects without sharp croppings, whose background is easily removed for seamless integration into design projects or artistic scenes. We show with a user study that, in most cases, users prefer our solution over generating and then matting an image, and we show that our generated illustrations yield good results when used as inputs for composite scene generation pipelines. We release the code at https://github.com/aimagelab/Alfie.

Alfie: Democratising RGBA Image Generation With No $$$

TL;DR

) through inference-time adaptations: subject-centering via a foreground/background latent split and mask-guided blending, and alpha-channel estimation derived from cross- and self-attention maps, followed by foreground cleanup with GrabCut. The authors evaluate Alfie on containment metrics, CLIP-based prompt fidelity, and user preference, reporting strong containment (>95%), CLIP-S near reference, and a 63% user preference over matting, as well as demonstrating compositional scene generation with Collage Diffusion. The results show Alfie can produce ready-to-use RGBA illustrations with minimal cost and effort, enabling straightforward integration into visual designs and automated scene composition pipelines. The work also provides code and an evaluation setup to foster further research into low-cost, inference-time adaptations for RGBA image generation.

Abstract

Paper Structure (7 sections, 7 equations, 7 figures, 1 table)

This paper contains 7 sections, 7 equations, 7 figures, 1 table.

Introduction
Related Work
Preliminaries
Inference-Time Illustration Generation
Experiments and Results
Results
Conclusions

Figures (7)

Figure 1: We propose a fully-automated pipeline to generate RGBA illustrations by adapting the inference-time behavior of a Diffusion Transformer model.
Figure 2: Schematic representation of our fully-automated prompt-guided pipeline to obtain RGBA illustrations. The core element is a diffusion model, for which we devise an inference-time adaptation strategy aimed at making the generated images illustration-like. Then, we process the generation attention maps to estimate the $\alpha$ channel.
Figure 3: Cross-attention map analysis of the prompt A photo of a bullmastiff with a jacket. Left to right: maps of the three prompt nouns, candidate region mask computed w/ and w/o the generic noun photo, and foreground extraction results.
Figure 4: Qualitative comparison on the effect of our constrained whole-subject generation compared to meta-descriptions in the input prompt.
Figure 5: Qualitative comparison on different $\alpha$ estimates. Combining self- and cross-attention maps provides the best balance of spatial localization and transparency values, and their cleanup using GrabCut rother2004grabcut (Alfie) further increases border precision.
...and 2 more figures

Alfie: Democratising RGBA Image Generation With No $$$

TL;DR

Abstract

Alfie: Democratising RGBA Image Generation With No $$$

Authors

TL;DR

Abstract

Table of Contents

Figures (7)