Table of Contents
Fetching ...

UISearch: Graph-Based Embeddings for Multimodal Enterprise UI Screenshots Retrieval

Maroun Ayli, Youssef Bakouny, Tushar Sharma, Nader Jalloul, Hani Seifeddine, Rima Kilany

TL;DR

This work tackles the challenge of retrieving enterprise UI screens by introducing UISearch, a graph-based representation that converts screenshots into attributed graphs capturing hierarchical structure and spatial relationships. A contrastive graph autoencoder learns structure-aware embeddings, enabling discrimination beyond what traditional vision-language models can achieve. A hybrid indexing system combines FAISS-based vector search with metadata filtering to support complex multimodal queries with subsecond latency on ~20K UIs. The approach delivers strong retrieval accuracy across semantic, structural, and metadata modalities, and demonstrates the practical viability of deployable, scalable, structure-aware UI search in real-world enterprise settings.

Abstract

Enterprise software companies maintain thousands of user interface screens across products and versions, creating critical challenges for design consistency, pattern discovery, and compliance check. Existing approaches rely on visual similarity or text semantics, lacking explicit modeling of structural properties fundamental to user interface (UI) composition. We present a novel graph-based representation that converts UI screenshots into attributed graphs encoding hierarchical relationships and spatial arrangements, potentially generalizable to document layouts, architectural diagrams, and other structured visual domains. A contrastive graph autoencoder learns embeddings preserving multi-level similarity across visual, structural, and semantic properties. The comprehensive analysis demonstrates that our structural embeddings achieve better discriminative power than state-of-the-art Vision Encoders, representing a fundamental advance in the expressiveness of the UI representation. We implement this representation in UISearch, a multi-modal search framework that combines structural embeddings with semantic search through a composable query language. On 20,396 financial software UIs, UISearch achieves 0.92 Top-5 accuracy with 47.5ms median latency (P95: 124ms), scaling to 20,000+ screens. The hybrid indexing architecture enables complex queries and supports fine-grained UI distinction impossible with vision-only approaches.

UISearch: Graph-Based Embeddings for Multimodal Enterprise UI Screenshots Retrieval

TL;DR

This work tackles the challenge of retrieving enterprise UI screens by introducing UISearch, a graph-based representation that converts screenshots into attributed graphs capturing hierarchical structure and spatial relationships. A contrastive graph autoencoder learns structure-aware embeddings, enabling discrimination beyond what traditional vision-language models can achieve. A hybrid indexing system combines FAISS-based vector search with metadata filtering to support complex multimodal queries with subsecond latency on ~20K UIs. The approach delivers strong retrieval accuracy across semantic, structural, and metadata modalities, and demonstrates the practical viability of deployable, scalable, structure-aware UI search in real-world enterprise settings.

Abstract

Enterprise software companies maintain thousands of user interface screens across products and versions, creating critical challenges for design consistency, pattern discovery, and compliance check. Existing approaches rely on visual similarity or text semantics, lacking explicit modeling of structural properties fundamental to user interface (UI) composition. We present a novel graph-based representation that converts UI screenshots into attributed graphs encoding hierarchical relationships and spatial arrangements, potentially generalizable to document layouts, architectural diagrams, and other structured visual domains. A contrastive graph autoencoder learns embeddings preserving multi-level similarity across visual, structural, and semantic properties. The comprehensive analysis demonstrates that our structural embeddings achieve better discriminative power than state-of-the-art Vision Encoders, representing a fundamental advance in the expressiveness of the UI representation. We implement this representation in UISearch, a multi-modal search framework that combines structural embeddings with semantic search through a composable query language. On 20,396 financial software UIs, UISearch achieves 0.92 Top-5 accuracy with 47.5ms median latency (P95: 124ms), scaling to 20,000+ screens. The hybrid indexing architecture enables complex queries and supports fine-grained UI distinction impossible with vision-only approaches.

Paper Structure

This paper contains 20 sections, 7 equations, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: UISearch system architecture showing the complete pipeline from UI screenshots through graph construction, contrastive learning, and hybrid indexing to ranked search results.
  • Figure 2: UISearch structural embeddings (green) maintain superior image discrimination with broad cosine similarity distributions (0.0-0.4), while vision-language models (CLIP variants, SigLIP, DINOv2) exhibit representation collapse with narrow peaks near 1.0.