Table of Contents
Fetching ...

DesignSense: A Human Preference Dataset and Reward Modeling Framework for Graphic Layout Generation

Varun Gopal, Rishabh Jain, Aradhya Mathur, Nikitha SR, Sohan Patnaik, Sudhir Yarram, Mayur Hemani, Balaji Krishnamurthy, Mausoom Sarkar

TL;DR

This work introduces DesignSense-10k, a large-scale dataset of 10,235 human-annotated preference pairs for graphic layout evaluation, and trains DesignSense, a vision-language model-based classifier that substantially outperforms existing open-source and proprietary models across comprehensive evaluation metrics.

Abstract

Graphic layouts serve as an important and engaging medium for visual communication across different channels. While recent layout generation models have demonstrated impressive capabilities, they frequently fail to align with nuanced human aesthetic judgment. Existing preference datasets and reward models trained on text-to-image generation do not generalize to layout evaluation, where the spatial arrangement of identical elements determines quality. To address this critical gap, we introduce DesignSense-10k, a large-scale dataset of 10,235 human-annotated preference pairs for graphic layout evaluation. We propose a five-stage curation pipeline that generates visually coherent layout transformations across diverse aspect ratios, using semantic grouping, layout prediction, filtering, clustering, and VLM-based refinement to produce high-quality comparison pairs. Human preferences are annotated using a 4-class scheme (left, right, both good, both bad) to capture subjective ambiguity. Leveraging this dataset, we train DesignSense, a vision-language model-based classifier that substantially outperforms existing open-source and proprietary models across comprehensive evaluation metrics (54.6% improvement in Macro F1 over the strongest proprietary baseline). Our analysis shows that frontier VLMs remain unreliable overall and fail catastrophically on the full four-class task, underscoring the need for specialized, preference-aware models. Beyond the dataset, our reward model DesignSense yields tangible downstream gains in layout generation. Using our judge during RL based training improves generator win rate by about 3%, while inference-time scaling, which involves generating multiple candidates and selecting the best one, provides a 3.6% improvement. These results highlight the practical impact of specialized, layout-aware preference modeling on real-world layout generation quality.

DesignSense: A Human Preference Dataset and Reward Modeling Framework for Graphic Layout Generation

TL;DR

This work introduces DesignSense-10k, a large-scale dataset of 10,235 human-annotated preference pairs for graphic layout evaluation, and trains DesignSense, a vision-language model-based classifier that substantially outperforms existing open-source and proprietary models across comprehensive evaluation metrics.

Abstract

Graphic layouts serve as an important and engaging medium for visual communication across different channels. While recent layout generation models have demonstrated impressive capabilities, they frequently fail to align with nuanced human aesthetic judgment. Existing preference datasets and reward models trained on text-to-image generation do not generalize to layout evaluation, where the spatial arrangement of identical elements determines quality. To address this critical gap, we introduce DesignSense-10k, a large-scale dataset of 10,235 human-annotated preference pairs for graphic layout evaluation. We propose a five-stage curation pipeline that generates visually coherent layout transformations across diverse aspect ratios, using semantic grouping, layout prediction, filtering, clustering, and VLM-based refinement to produce high-quality comparison pairs. Human preferences are annotated using a 4-class scheme (left, right, both good, both bad) to capture subjective ambiguity. Leveraging this dataset, we train DesignSense, a vision-language model-based classifier that substantially outperforms existing open-source and proprietary models across comprehensive evaluation metrics (54.6% improvement in Macro F1 over the strongest proprietary baseline). Our analysis shows that frontier VLMs remain unreliable overall and fail catastrophically on the full four-class task, underscoring the need for specialized, preference-aware models. Beyond the dataset, our reward model DesignSense yields tangible downstream gains in layout generation. Using our judge during RL based training improves generator win rate by about 3%, while inference-time scaling, which involves generating multiple candidates and selecting the best one, provides a 3.6% improvement. These results highlight the practical impact of specialized, layout-aware preference modeling on real-world layout generation quality.
Paper Structure (29 sections, 2 equations, 3 figures, 4 tables)

This paper contains 29 sections, 2 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Examples of predicted preference ordering by frontier vision-language models and our DesignSense model trained with the DesignSense-10k dataset. Each layout pair is evaluated through a 4-class annotation protocol: "Left," "Right," "Both Good," and "Both Bad." The right panel summarizes model agreement with human judgments. (fs = few-shot; deep = deep thinking)
  • Figure 2: Overview of the DesignSense data curation pipeline, illustrated in five main steps:Step 1: Grouping - Elements from the original layout are grouped based on semantic and spatial relationships to reduce structural complexity and preserve design intent.Step 2: Prediction - Grouped elements are fed into a layout prediction model to generate multiple candidate relayouts under diverse aspect ratio conditions. Step 3: Clustering & Filtering - Generated layouts are clustered to maximize output diversity and filtered to retain only high-quality candidates, selecting the top three most distinct layouts for each setting. Step 4: Refinement - Selected layouts are further improved using a refinement module which optimizes element positions, resolves overlaps, and enhances overall visual alignment. This end-to-end process enables the construction of a large-scale, diverse, and preference-annotated graphic layout dataset.
  • Figure 3: Dataset statistics and annotation analysis for DesignSense. Top left: Distribution of image aspect ratios (log$_2$(width/height)) illustrating the diversity of layouts. Top right: Pie charts showing the distribution of relayout settings (“Stretching_2x,” “Reverse Ratio,” “Original Ratio”) and annotation result classes (“Both Bad,” “Both Good,” “Left,” “Right”). Bottom left: Histogram of number of groups per sample, and number of elements per sample, highlighting compositional complexity. Bottom right: Interhuman agreement illustrating substantially higher consistency among our annotators compared to random choice, for both 4-class and 2-class settings.