Table of Contents
Fetching ...

Leveraging Multimodal LLM for Inspirational User Interface Search

Seokhyeon Park, Yumin Song, Soohyun Lee, Jaeyoung Kim, Jinwook Seo

TL;DR

The paper tackles the challenge of semantically rich inspirational UI search by eliminating reliance on metadata and pixel similarity. It introduces a pipeline that uses a multimodal large language model to extract rich UI semantics directly from mobile UI screenshots and assembles them into a semantic-based retrieval system called S&UI. Through computational evaluations on UI datasets and extensive human studies with designers, the authors demonstrate that semantic extraction plus S&UI outperforms traditional pixel-/metadata-based baselines in relevance, reliability, usefulness, diversity, and serendipity. The work advances UI design tooling by enabling context-aware, explainable inspiration and provides a public S&UI dataset to accelerate future research in semantic UI understanding and retrieval.

Abstract

Inspirational search, the process of exploring designs to inform and inspire new creative work, is pivotal in mobile user interface (UI) design. However, exploring the vast space of UI references remains a challenge. Existing AI-based UI search methods often miss crucial semantics like target users or the mood of apps. Additionally, these models typically require metadata like view hierarchies, limiting their practical use. We used a multimodal large language model (MLLM) to extract and interpret semantics from mobile UI images. We identified key UI semantics through a formative study and developed a semantic-based UI search system. Through computational and human evaluations, we demonstrate that our approach significantly outperforms existing UI retrieval methods, offering UI designers a more enriched and contextually relevant search experience. We enhance the understanding of mobile UI design semantics and highlight MLLMs' potential in inspirational search, providing a rich dataset of UI semantics for future studies.

Leveraging Multimodal LLM for Inspirational User Interface Search

TL;DR

The paper tackles the challenge of semantically rich inspirational UI search by eliminating reliance on metadata and pixel similarity. It introduces a pipeline that uses a multimodal large language model to extract rich UI semantics directly from mobile UI screenshots and assembles them into a semantic-based retrieval system called S&UI. Through computational evaluations on UI datasets and extensive human studies with designers, the authors demonstrate that semantic extraction plus S&UI outperforms traditional pixel-/metadata-based baselines in relevance, reliability, usefulness, diversity, and serendipity. The work advances UI design tooling by enabling context-aware, explainable inspiration and provides a public S&UI dataset to accelerate future research in semantic UI understanding and retrieval.

Abstract

Inspirational search, the process of exploring designs to inform and inspire new creative work, is pivotal in mobile user interface (UI) design. However, exploring the vast space of UI references remains a challenge. Existing AI-based UI search methods often miss crucial semantics like target users or the mood of apps. Additionally, these models typically require metadata like view hierarchies, limiting their practical use. We used a multimodal large language model (MLLM) to extract and interpret semantics from mobile UI images. We identified key UI semantics through a formative study and developed a semantic-based UI search system. Through computational and human evaluations, we demonstrate that our approach significantly outperforms existing UI retrieval methods, offering UI designers a more enriched and contextually relevant search experience. We enhance the understanding of mobile UI design semantics and highlight MLLMs' potential in inspirational search, providing a rich dataset of UI semantics for future studies.

Paper Structure

This paper contains 60 sections, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Illustration of prompt and semantic output for mobile UI semantic extraction using multimodal LLM. The left panel displays our prompt concept (the mobile UI screenshot and additional prompts (i.e., assistant persona, instruction, feature list, feature definition & instruction, and response form), and the right panel presents the structured YAML-formatted output detailing key semantic attributes extracted by GPT-4o.
  • Figure 2: The query and retrieval method illustrates how user-specified semantics and weights are processed to retrieve relevant UI designs from the database. The system computes cosine similarity scores between the query and the designs, considering user-adjusted semantic weights to prioritize certain elements.
  • Figure 3: The S&UI system interface. The system enables designers to search UI screens using key UI semantics, such as app category, mood, and screen role. The Search Panel allows the addition of semantic descriptions and weight adjustment according to user needs. The Result Panel displays relevant screens retrieved based on semantic queries. In the Screen Detail Panel, designers can explore detailed screen semantics, iterating and refining their search with the Import feature. The Find Next & Previous functionality helps find screens based on estimated user flows, enhancing designers' contextual understanding during the design inspiration process.
  • Figure 4: Comparison of Correct Screen and App Category Predictions: The top chart compares GPT-4o and GUIClip across 20 screen categories, while the bottom chart contrasts GPT-4o and CLIP over 31 app categories. Top-1 predictions are highlighted, with top-3 shown semi-transparently. GPT-4o outperforms baseline methods in both cases.
  • Figure 5: Box plots comparing the syntactic dependency complexity (left) and POS diversity scores of screen descriptions (right) from the Screen2Words dataset (average and best) and GPT-4o. GPT-4o generates more complex and diverse descriptions than the Screen2Words dataset, as indicated by the higher scores in both metrics.
  • ...and 4 more figures