Table of Contents
Fetching ...

DesignPref: Capturing Personal Preferences in Visual Design Generation

Yi-Hao Peng, Jeffrey P. Bigham, Jason Wu

TL;DR

This work tackles the problem of subjectivity in visual UI design evaluation by introducing DesignPref, a dataset of 12,000 designer-authored pairwise comparisons with multi-level ratings and rationales. It demonstrates substantial inter-designer disagreement and analyzes the reasons behind divergent judgments, enabling identity-aware modeling. Through CLIP finetuning with a strength-aware margin and retrieval-augmented generation, the authors show that personalized models outperform aggregated baselines, achieving strong performance with far fewer personalized examples. The findings suggest that encoding designer identity and preference strength can significantly improve automated UI design assessment and pave the way for personalized design generation and evaluation workflows with practical impact for designers and developers.

Abstract

Generative models, such as large language models and text-to-image diffusion models, are increasingly used to create visual designs like user interfaces (UIs) and presentation slides. Finetuning and benchmarking these generative models have often relied on datasets of human-annotated design preferences. Yet, due to the subjective and highly personalized nature of visual design, preference varies widely among individuals. In this paper, we study this problem by introducing DesignPref, a dataset of 12k pairwise comparisons of UI design generation annotated by 20 professional designers with multi-level preference ratings. We found that among trained designers, substantial levels of disagreement exist (Krippendorff's alpha = 0.25 for binary preferences). Natural language rationales provided by these designers indicate that disagreements stem from differing perceptions of various design aspect importance and individual preferences. With DesignPref, we demonstrate that traditional majority-voting methods for training aggregated judge models often do not accurately reflect individual preferences. To address this challenge, we investigate multiple personalization strategies, particularly fine-tuning or incorporating designer-specific annotations into RAG pipelines. Our results show that personalized models consistently outperform aggregated baseline models in predicting individual designers' preferences, even when using 20 times fewer examples. Our work provides the first dataset to study personalized visual design evaluation and support future research into modeling individual design taste.

DesignPref: Capturing Personal Preferences in Visual Design Generation

TL;DR

This work tackles the problem of subjectivity in visual UI design evaluation by introducing DesignPref, a dataset of 12,000 designer-authored pairwise comparisons with multi-level ratings and rationales. It demonstrates substantial inter-designer disagreement and analyzes the reasons behind divergent judgments, enabling identity-aware modeling. Through CLIP finetuning with a strength-aware margin and retrieval-augmented generation, the authors show that personalized models outperform aggregated baselines, achieving strong performance with far fewer personalized examples. The findings suggest that encoding designer identity and preference strength can significantly improve automated UI design assessment and pave the way for personalized design generation and evaluation workflows with practical impact for designers and developers.

Abstract

Generative models, such as large language models and text-to-image diffusion models, are increasingly used to create visual designs like user interfaces (UIs) and presentation slides. Finetuning and benchmarking these generative models have often relied on datasets of human-annotated design preferences. Yet, due to the subjective and highly personalized nature of visual design, preference varies widely among individuals. In this paper, we study this problem by introducing DesignPref, a dataset of 12k pairwise comparisons of UI design generation annotated by 20 professional designers with multi-level preference ratings. We found that among trained designers, substantial levels of disagreement exist (Krippendorff's alpha = 0.25 for binary preferences). Natural language rationales provided by these designers indicate that disagreements stem from differing perceptions of various design aspect importance and individual preferences. With DesignPref, we demonstrate that traditional majority-voting methods for training aggregated judge models often do not accurately reflect individual preferences. To address this challenge, we investigate multiple personalization strategies, particularly fine-tuning or incorporating designer-specific annotations into RAG pipelines. Our results show that personalized models consistently outperform aggregated baseline models in predicting individual designers' preferences, even when using 20 times fewer examples. Our work provides the first dataset to study personalized visual design evaluation and support future research into modeling individual design taste.

Paper Structure

This paper contains 35 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Our annotation interface includes a prompt that defines the design task. Two variants generated from the prompt appear side by side. The rater selects one of four preference options.
  • Figure 2: Cohen’s kappa agreement for pairwise binary preferences across designers.
  • Figure 3: Embedding clusters for preference rationale themes.
  • Figure 4: Rationales behind divergent preferences. Each pair shows an example where half of the designers chose A and half chose B.