Table of Contents
Fetching ...

AesthetiQ: Enhancing Graphic Layout Design via Aesthetic-Aware Preference Alignment of Multi-modal Large Language Models

Sohan Patnaik, Rishabh Jain, Balaji Krishnamurthy, Mausoom Sarkar

TL;DR

This work targets aesthetic-aware graphic layout generation by addressing limitations of cross-entropy losses in layouts. It introduces Aesthetic-Aware Preference Alignment (AAPA), which leverages a judge MLLM to rank multiple candidate layouts and trains the layout predictor via a Direct Preference Optimization–based loss. A key contribution is a data-quality filtering protocol using alignment and overlap heuristics, along with a novel MLLM-based win-rate metric to evaluate aesthetics beyond traditional IoU metrics. Evaluations on Crello and WebUI show substantial gains, with larger models (up to 8B parameters) achieving notable improvements in both geometric accuracy (Mean IoU) and aesthetic alignment (Judge Win Rate), demonstrating the feasibility of integrating aesthetic preferences into multi-modal layout generation.

Abstract

Visual layouts are essential in graphic design fields such as advertising, posters, and web interfaces. The application of generative models for content-aware layout generation has recently gained traction. However, these models fail to understand the contextual aesthetic requirements of layout design and do not align with human-like preferences, primarily treating it as a prediction task without considering the final rendered output. To overcome these problems, we offer Aesthetic-Aware Preference Alignment(AAPA), a novel technique to train a Multi-modal Large Language Model (MLLM) for layout prediction that uses MLLM's aesthetic preferences for Direct Preference Optimization over graphic layouts. We propose a data filtering protocol utilizing our layout-quality heuristics for AAPA to ensure training happens on high-quality layouts. Additionally, we introduce a novel evaluation metric that uses another MLLM to compute the win rate of the generated layout against the ground-truth layout based on aesthetics criteria. We also demonstrate the applicability of AAPA for MLLMs of varying scales (1B to 8B parameters) and LLM families (Qwen, Phi, InternLM). By conducting thorough qualitative and quantitative analyses, we verify the efficacy of our approach on two challenging benchmarks - Crello and Webui, showcasing 17%, and 16 improvement over current State-of-The-Art methods, thereby highlighting the potential of MLLMs in aesthetic-aware layout generation.

AesthetiQ: Enhancing Graphic Layout Design via Aesthetic-Aware Preference Alignment of Multi-modal Large Language Models

TL;DR

This work targets aesthetic-aware graphic layout generation by addressing limitations of cross-entropy losses in layouts. It introduces Aesthetic-Aware Preference Alignment (AAPA), which leverages a judge MLLM to rank multiple candidate layouts and trains the layout predictor via a Direct Preference Optimization–based loss. A key contribution is a data-quality filtering protocol using alignment and overlap heuristics, along with a novel MLLM-based win-rate metric to evaluate aesthetics beyond traditional IoU metrics. Evaluations on Crello and WebUI show substantial gains, with larger models (up to 8B parameters) achieving notable improvements in both geometric accuracy (Mean IoU) and aesthetic alignment (Judge Win Rate), demonstrating the feasibility of integrating aesthetic preferences into multi-modal layout generation.

Abstract

Visual layouts are essential in graphic design fields such as advertising, posters, and web interfaces. The application of generative models for content-aware layout generation has recently gained traction. However, these models fail to understand the contextual aesthetic requirements of layout design and do not align with human-like preferences, primarily treating it as a prediction task without considering the final rendered output. To overcome these problems, we offer Aesthetic-Aware Preference Alignment(AAPA), a novel technique to train a Multi-modal Large Language Model (MLLM) for layout prediction that uses MLLM's aesthetic preferences for Direct Preference Optimization over graphic layouts. We propose a data filtering protocol utilizing our layout-quality heuristics for AAPA to ensure training happens on high-quality layouts. Additionally, we introduce a novel evaluation metric that uses another MLLM to compute the win rate of the generated layout against the ground-truth layout based on aesthetics criteria. We also demonstrate the applicability of AAPA for MLLMs of varying scales (1B to 8B parameters) and LLM families (Qwen, Phi, InternLM). By conducting thorough qualitative and quantitative analyses, we verify the efficacy of our approach on two challenging benchmarks - Crello and Webui, showcasing 17%, and 16 improvement over current State-of-The-Art methods, thereby highlighting the potential of MLLMs in aesthetic-aware layout generation.

Paper Structure

This paper contains 20 sections, 11 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Existing cross-entropy loss based methods penalize element misalignment heavily while preferential tuning via AAPA better capture aesthetic nuances in layouts
  • Figure 2: The training for the aesthetic layout prediction task consists of the following steps: 1) Vision Encoder: Design elements (images and text) are processed to generate image and text embeddings. 2) AesthetiQ Model Prediction: Embeddings are passed to the AesthetiQ model, which predicts layout coordinates. 3) Training with Cross-Entropy Loss: The predicted layout is compared with the ground truth and trained using cross-entropy loss. 4) Sampling for Comparison: Multiple layout predictions are generated using AesthetiQ inference. 5) Pair Selection and Quality Filtering: We filter the data based on quality heuristics to ensure layout quality in samples. 6) Judging by ViLA: The ViLA model compares layout pairs and selects the better one based on aesthetic preferences. 7) Aesthetic Preference Optimization (AAPA): Feedback from ViLA is used to fine-tune the AesthetiQ model for aesthetic optimization.
  • Figure 3: Qualitative comparison of our model, AesthetiQ, against recent methods FlexDM, LACE, and LayoutNUWA. Despite the challenge of arranging numerous elements, AesthetiQ consistently achieves superior layout quality. In row (a), AesthetiQ effectively places text within salient regions, maintaining clear hierarchy and avoiding overlaps, which enhances readability and aesthetic appeal. In row (b), it achieves precise alignment across elements and optimally positions diverse shapes, preserving a cohesive visual structure. Row (c) showcases AesthetiQ's advanced semantic understanding, generating a visually balanced and aesthetically pleasing layout. Overall, AesthetiQ consistently outperforms competitors in creating coherent, well-structured designs that align with human aesthetic preferences.
  • Figure 4: Performance improvement across scale (1B–8B parameters) for layout generation, showing effects of pretraining, quality filtering, and Aesthetic-Aware Preference Alignment (AAPA). Left: IoU progression under different training configurations. Middle:$\mathcal{M_{\text{judge}}}$ Win Rate improvements, emphasizing the impact of AAPA and pretraining. Right: Configuration table indicating settings for each experiment. The results underscore the impact of each design component in AesthetiQ, emphasizing their role in tackling layout generation challenges.
  • Figure 5: Capability of AesthetiQ to generate templates in various aspect ratios by changing the canvas height and width
  • ...and 2 more figures