COLE: A Hierarchical Generation Framework for Multi-Layered and Editable Graphic Design

Peidong Jia; Chenxuan Li; Yuhui Yuan; Zeyu Liu; Yichao Shen; Bohan Chen; Xingru Chen; Yinglin Zheng; Dong Chen; Ji Li; Xiaodong Xie; Shanghang Zhang; Baining Guo

COLE: A Hierarchical Generation Framework for Multi-Layered and Editable Graphic Design

Peidong Jia, Chenxuan Li, Yuhui Yuan, Zeyu Liu, Yichao Shen, Bohan Chen, Xingru Chen, Yinglin Zheng, Dong Chen, Ji Li, Xiaodong Xie, Shanghang Zhang, Baining Guo

TL;DR

COLE tackles the challenge of turning simple user intents into editable, multi-layered graphic designs by introducing a hierarchical pipeline that distributes design tasks across specialized LLMs, diffusion models, and multimodal modules. The framework decomposes the problem into intention-to-JSON planning, background and object layer generation, typography reasoning, and a layer-editable SVG renderer, all guided by Reflect- and Quality-focused modules. It also introduces DesignerIntention, a benchmark to evaluate design-intent fidelity and aesthetics, and demonstrates competitive performance against DALL·E3 and CanvaGPT while preserving editability. Collectively, COLE advances reliable, design-aware graphic design generation and provides a practical, editable end product for designers and non-designers alike.

Abstract

Graphic design, which has been evolving since the 15th century, plays a crucial role in advertising. The creation of high-quality designs demands design-oriented planning, reasoning, and layer-wise generation. Unlike the recent CanvaGPT, which integrates GPT-4 with existing design templates to build a custom GPT, this paper introduces the COLE system - a hierarchical generation framework designed to comprehensively address these challenges. This COLE system can transform a vague intention prompt into a high-quality multi-layered graphic design, while also supporting flexible editing based on user input. Examples of such input might include directives like ``design a poster for Hisaishi's concert.'' The key insight is to dissect the complex task of text-to-design generation into a hierarchy of simpler sub-tasks, each addressed by specialized models working collaboratively. The results from these models are then consolidated to produce a cohesive final output. Our hierarchical task decomposition can streamline the complex process and significantly enhance generation reliability. Our COLE system comprises multiple fine-tuned Large Language Models (LLMs), Large Multimodal Models (LMMs), and Diffusion Models (DMs), each specifically tailored for design-aware layer-wise captioning, layout planning, reasoning, and the task of generating images and text. Furthermore, we construct the DESIGNINTENTION benchmark to demonstrate the superiority of our COLE system over existing methods in generating high-quality graphic designs from user intent. Last, we present a Canva-like multi-layered image editing tool to support flexible editing of the generated multi-layered graphic design images. We perceive our COLE system as an important step towards addressing more complex and multi-layered graphic design generation tasks in the future.

COLE: A Hierarchical Generation Framework for Multi-Layered and Editable Graphic Design

TL;DR

Abstract

Paper Structure (15 sections, 22 figures, 10 tables)

This paper contains 15 sections, 22 figures, 10 tables.

Introduction
Related Work
Our Approach
Cole Framework Overview
Design LLM: Intention Recaption and Layout Planning
Text-to-Background Diffusion Model: Visual Planning to Generate Canvas Placeholders
Text-to-Object Diffusion Model: Visual Reasoning based on the Generated Background Image
Typography LMM: Layout Planning and Attribute Reasoning for Visual Text
Multi-Layered SVG Editor and Renderer: Support Layer-wise Flexible User Editing
Reflect LMM & Quality LMM
Experiment
DesignerIntention Benchmark
Main Results
Ablation Experiments
Conclusion

Figures (22)

Figure 1: Illustrating the multi-layered graphic design images generated by our Cole system (first row, we display the multi-layer image layers at the top-right corner of each design image) and the combination of DALL$\cdot$E3 background images and Cole system (second row). See the appendix for detailed intention prompts. As shown in the second row, our Cole system skillfully plans design layouts and selects harmonious fonts, colors, sizes, and positions through insightful analysis and reasoning, even with out-of-domain DALL$\cdot$E3 background images after pre-processing. By default, we do not use DALL$\cdot$E3 background images in all other results.
Figure 2: Illustrating the design images generated by DALL$\cdot$E3 (augmented with GPT-$4$), using our DesignerIntention.
Figure 3: Comparison with DALL$\cdot$E3 and CanvaGPT based on user study.
Figure 4: Illustrating the detailed hierarchical pipelines of the proposed Cole system. Upon receiving a user's intention, our initial step involves using a Design-LLM to translate the intention into a detailed JSON file. This process necessitates multi-layered layout planning by predicting a wide range of attributes for the required visual elements. Next, we engage a pair of cascaded diffusion models for the text-to-background generation and text-to-object (and alpha mask) generation processes. These models play a crucial role in creating visual assets, guided not only by the text instructions specified in the JSON file but also by the need to reason about their visual spatial relationships to ensure a coherent design. Additionally, we have developed a typography-LMM that predicts the typography JSON file by analyzing and reasoning about the previously predicted text contents, background image, and object image. Last, we apply a multi-layered SVG editor and rendering system to enable flexible user modifications on individual layers, allowing for the composition and output of the final image.
Figure 5: Illustrating the example of generated intention-to-JSON pairs data for the given image. We can see that the user intention is vague and the JSON file is much more informative. Readers are kindly suggested to zoom into the figure for a clearer view. The image originates from our training dataset.
...and 17 more figures

COLE: A Hierarchical Generation Framework for Multi-Layered and Editable Graphic Design

TL;DR

Abstract

COLE: A Hierarchical Generation Framework for Multi-Layered and Editable Graphic Design

Authors

TL;DR

Abstract

Table of Contents

Figures (22)