Table of Contents
Fetching ...

PosterCopilot: Toward Layout Reasoning and Controllable Editing for Professional Graphic Design

Jiazhe Wei, Ken Li, Tianyu Lao, Haofan Wang, Liang Wang, Caifeng Shan, Chenyang Si

TL;DR

PosterCopilot addresses the misalignment between geometric layout and aesthetics in Large Multimodal Model–driven poster design by decoupling layout reasoning from editing. It introduces a three-stage training pipeline—PSFT, RL-VRA, and RLAF—to teach continuous spatial reasoning and human-aligned aesthetics, complemented by a generative agent for asset synthesis and precise layer-wise editing. Empirical results show improved geometric accuracy, style coherence, and editability, outperforming commercial and academic baselines and enabling practical, multi-round professional editing. The work also provides a large-scale, multi-layer poster dataset and a detailed evaluation framework tailored to graphic design quality beyond conventional image metrics.

Abstract

Graphic design forms the cornerstone of modern visual communication, serving as a vital medium for promoting cultural and commercial events. Recent advances have explored automating this process using Large Multimodal Models (LMMs), yet existing methods often produce geometrically inaccurate layouts and lack the iterative, layer-specific editing required in professional workflows. To address these limitations, we present PosterCopilot, a framework that advances layout reasoning and controllable editing for professional graphic design. Specifically, we introduce a progressive three-stage training strategy that equips LMMs with geometric understanding and aesthetic reasoning for layout design, consisting of Perturbed Supervised Fine-Tuning, Reinforcement Learning for Visual-Reality Alignment, and Reinforcement Learning from Aesthetic Feedback. Furthermore, we develop a complete workflow that couples the trained LMM-based design model with generative models, enabling layer-controllable, iterative editing for precise element refinement while maintaining global visual consistency. Extensive experiments demonstrate that PosterCopilot achieves geometrically accurate and aesthetically superior layouts, offering unprecedented controllability for professional iterative design.

PosterCopilot: Toward Layout Reasoning and Controllable Editing for Professional Graphic Design

TL;DR

PosterCopilot addresses the misalignment between geometric layout and aesthetics in Large Multimodal Model–driven poster design by decoupling layout reasoning from editing. It introduces a three-stage training pipeline—PSFT, RL-VRA, and RLAF—to teach continuous spatial reasoning and human-aligned aesthetics, complemented by a generative agent for asset synthesis and precise layer-wise editing. Empirical results show improved geometric accuracy, style coherence, and editability, outperforming commercial and academic baselines and enabling practical, multi-round professional editing. The work also provides a large-scale, multi-layer poster dataset and a detailed evaluation framework tailored to graphic design quality beyond conventional image metrics.

Abstract

Graphic design forms the cornerstone of modern visual communication, serving as a vital medium for promoting cultural and commercial events. Recent advances have explored automating this process using Large Multimodal Models (LMMs), yet existing methods often produce geometrically inaccurate layouts and lack the iterative, layer-specific editing required in professional workflows. To address these limitations, we present PosterCopilot, a framework that advances layout reasoning and controllable editing for professional graphic design. Specifically, we introduce a progressive three-stage training strategy that equips LMMs with geometric understanding and aesthetic reasoning for layout design, consisting of Perturbed Supervised Fine-Tuning, Reinforcement Learning for Visual-Reality Alignment, and Reinforcement Learning from Aesthetic Feedback. Furthermore, we develop a complete workflow that couples the trained LMM-based design model with generative models, enabling layer-controllable, iterative editing for precise element refinement while maintaining global visual consistency. Extensive experiments demonstrate that PosterCopilot achieves geometrically accurate and aesthetically superior layouts, offering unprecedented controllability for professional iterative design.

Paper Structure

This paper contains 44 sections, 20 equations, 20 figures, 5 tables.

Figures (20)

  • Figure 1: Generated results from our PosterCopilot. PosterCopilot exhibits exceptional graphic design capabilities by creating artworks with professional-grade layout, compelling visuals, and cohesive themes.
  • Figure 2: Some failure cases created by existing design models in real-world, multi-asset scenarios, producing severe misalignments and visual discord.
  • Figure 3: Overview of the training paradigm of PosterCopilot. Rather than formulating the training process as a simple regression task, we endow PosterCopilot with outstanding layout capabilities and human-like aesthetics through a three-stage training paradigm.
  • Figure 4: Geometric instability of text-based coordinate representations. (a) Euclidean Space: The ideal baseline, showing perfect, uniform geometry ($\det(S) \equiv 1$). (b) Text-Based Space: Suffers from signal collapse (near-zero $\det(S)$) and geometric noise, creating a chaotic landscape unstable for optimization. (c) Reconstructed Space via Neighborhood Averaging: This method suppresses noise, recovering a smooth, uniform geometry that is far more stable than (b).
  • Figure 5: Our motivation for visual-reality alignment and aesthetic feedback stems from the observation that design models frequently produce works that violate fundamental graphic design principles, as well as exhibit serious aesthetic flaws. We use red, green, and blue boxes to mark the error areas in the figure.
  • ...and 15 more figures