Table of Contents
Fetching ...

Visual Prompting with Iterative Refinement for Design Critique Generation

Peitong Duan, Chin-Yi Cheng, Bjoern Hartmann, Yang Li

TL;DR

This paper tackles automated UI design critique by introducing a modular, six-LLM prompting pipeline that iteratively refines both critique text and their bounding boxes, guided by visual prompts and zoomed-in patches. It achieves improved grounding and critique quality on the UI Crit task (UICrit) across Gemini-1.5-pro and GPT-4o, with human experts showing a preference for its outputs and a notable reduction in the gap to human performance. The approach also generalizes to open vocabulary object and attribute detection, providing gains in $mAP$ relative to a baseline, though not surpassing fine-tuned models. Overall, the method demonstrates a practical, scalable strategy for producing visually grounded, actionable feedback in multimodal design tasks and related grounding problems.

Abstract

Feedback is crucial for every design process, such as user interface (UI) design, and automating design critiques can significantly improve the efficiency of the design workflow. Although existing multimodal large language models (LLMs) excel in many tasks, they often struggle with generating high-quality design critiques -- a complex task that requires producing detailed design comments that are visually grounded in a given design's image. Building on recent advancements in iterative refinement of text output and visual prompting methods, we propose an iterative visual prompting approach for UI critique that takes an input UI screenshot and design guidelines and generates a list of design comments, along with corresponding bounding boxes that map each comment to a specific region in the screenshot. The entire process is driven completely by LLMs, which iteratively refine both the text output and bounding boxes using few-shot samples tailored for each step. We evaluated our approach using Gemini-1.5-pro and GPT-4o, and found that human experts generally preferred the design critiques generated by our pipeline over those by the baseline, with the pipeline reducing the gap from human performance by 50% for one rating metric. To assess the generalizability of our approach to other multimodal tasks, we applied our pipeline to open-vocabulary object and attribute detection, and experiments showed that our method also outperformed the baseline.

Visual Prompting with Iterative Refinement for Design Critique Generation

TL;DR

This paper tackles automated UI design critique by introducing a modular, six-LLM prompting pipeline that iteratively refines both critique text and their bounding boxes, guided by visual prompts and zoomed-in patches. It achieves improved grounding and critique quality on the UI Crit task (UICrit) across Gemini-1.5-pro and GPT-4o, with human experts showing a preference for its outputs and a notable reduction in the gap to human performance. The approach also generalizes to open vocabulary object and attribute detection, providing gains in relative to a baseline, though not surpassing fine-tuned models. Overall, the method demonstrates a practical, scalable strategy for producing visually grounded, actionable feedback in multimodal design tasks and related grounding problems.

Abstract

Feedback is crucial for every design process, such as user interface (UI) design, and automating design critiques can significantly improve the efficiency of the design workflow. Although existing multimodal large language models (LLMs) excel in many tasks, they often struggle with generating high-quality design critiques -- a complex task that requires producing detailed design comments that are visually grounded in a given design's image. Building on recent advancements in iterative refinement of text output and visual prompting methods, we propose an iterative visual prompting approach for UI critique that takes an input UI screenshot and design guidelines and generates a list of design comments, along with corresponding bounding boxes that map each comment to a specific region in the screenshot. The entire process is driven completely by LLMs, which iteratively refine both the text output and bounding boxes using few-shot samples tailored for each step. We evaluated our approach using Gemini-1.5-pro and GPT-4o, and found that human experts generally preferred the design critiques generated by our pipeline over those by the baseline, with the pipeline reducing the gap from human performance by 50% for one rating metric. To assess the generalizability of our approach to other multimodal tasks, we applied our pipeline to open-vocabulary object and attribute detection, and experiments showed that our method also outperformed the baseline.

Paper Structure

This paper contains 31 sections, 18 figures, 4 tables, 1 algorithm.

Figures (18)

  • Figure 1: Illustration of the UI Design Critique Task, which takes in a UI screenshot and a set of design guidelines and outputs a list of design comments with corresponding bounding boxes (Bbox).
  • Figure 2: The figure illustrates our prompting pipeline, which takes an image and a task prompt as input and outputs text items with their corresponding bounding boxes on the image. The pipeline consists of six distinct LLMs, organized into three modules: Text Generation and Refinement, Validation, and Bounding Box (Bbox) Generation and Refinement. Targeted few-shot examples are provided for each LLM. The main inputs and outputs for each LLM are shown, and Section \ref{['sec:pipeline']} details all the inputs, outputs, and few-shot examples for each LLM. Each input/output is numbered with their order of generation, and numbers with a '+' indicate multiple iterations of input/output.
  • Figure 3: An example of the inputs to the Bounding Box Refinement LLM.
  • Figure 4: Illustration of the Open Vocabulary Object and Attribute Detection Task. The example output is taken from Bravo_2023_CVPR.
  • Figure 5: Illustration of four example outputs from the pipeline. The screenshots are marked with the output bounding boxes, and the generated comments are shown, each pointing to its corresponding bounding box. Helpful comments with reasonably accurate bounding boxes are highlighted in screen.
  • ...and 13 more figures