Table of Contents
Fetching ...

Visual Embedding of Screen Sequences for User-Flow Search in Example-driven Communication

Daeheon Jeong, Hyehyun Chu

TL;DR

The paper tackles the challenge of effectively communicating UX considerations to diverse stakeholders by leveraging example-driven evidence in the form of user-flow screen sequences. It introduces a semantic embedding method that encodes sequences of screens using a ViT-based visual encoder and a one-layer multi-head attention pooler, trained with a symmetric contrastive loss to align with textual task descriptions: $\mathcal{L}_{contrast} = \tfrac{1}{2}[\mathrm{CE}(\mathbf{V}\mathbf{T}^{\mathsf{T}}/\tau) + \mathrm{CE}(\mathbf{T}\mathbf{V}^{\mathsf{T}}/\tau)]$, where $\tau=0.07$. Grounded in a formative study with four UX practitioners and a subsequent human-subject survey (n=21), the approach demonstrates that sequence-based retrieval is more aligned with human perceptions of relevance than text-only baselines and offers practical value as a design reference. The work contributes (1) an in-depth interview study on how practitioners communicate UX considerations, (2) a novel screen-sequence embedding method for semantic search, and (3) empirical evidence supporting integration into design workflows, with implications for computational representations of user flows.

Abstract

Effective communication of UX considerations to stakeholders (e.g., designers and developers) is a critical challenge for UX practitioners. To explore this problem, we interviewed four UX practitioners about their communication challenges and strategies. Our study identifies that providing an example user flow-a screen sequence representing a semantic task-as evidence reinforces communication, yet finding relevant examples remains challenging. To address this, we propose a method to systematically retrieve user flows using semantic embedding. Specifically, we design a model that learns to associate screens' visual features with user flow descriptions through contrastive learning. A survey confirms that our approach retrieves user flows better aligned with human perceptions of relevance. We analyze the results and discuss implications for the computational representation of user flows.

Visual Embedding of Screen Sequences for User-Flow Search in Example-driven Communication

TL;DR

The paper tackles the challenge of effectively communicating UX considerations to diverse stakeholders by leveraging example-driven evidence in the form of user-flow screen sequences. It introduces a semantic embedding method that encodes sequences of screens using a ViT-based visual encoder and a one-layer multi-head attention pooler, trained with a symmetric contrastive loss to align with textual task descriptions: , where . Grounded in a formative study with four UX practitioners and a subsequent human-subject survey (n=21), the approach demonstrates that sequence-based retrieval is more aligned with human perceptions of relevance than text-only baselines and offers practical value as a design reference. The work contributes (1) an in-depth interview study on how practitioners communicate UX considerations, (2) a novel screen-sequence embedding method for semantic search, and (3) empirical evidence supporting integration into design workflows, with implications for computational representations of user flows.

Abstract

Effective communication of UX considerations to stakeholders (e.g., designers and developers) is a critical challenge for UX practitioners. To explore this problem, we interviewed four UX practitioners about their communication challenges and strategies. Our study identifies that providing an example user flow-a screen sequence representing a semantic task-as evidence reinforces communication, yet finding relevant examples remains challenging. To address this, we propose a method to systematically retrieve user flows using semantic embedding. Specifically, we design a model that learns to associate screens' visual features with user flow descriptions through contrastive learning. A survey confirms that our approach retrieves user flows better aligned with human perceptions of relevance. We analyze the results and discuss implications for the computational representation of user flows.

Paper Structure

This paper contains 27 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Our method consists of three main components: (A) representing screen sequences as a series of visual features using ViT; (B) aggregating variable-length visual features using one-layer multi-head attention pooling with temporal encoding and masking; and (C) training the model on contrastive loss between screen sequence–text pairs.
  • Figure 2: Box plots comparing similarity scores between our sequence-based search model and the baseline model ($n = 21$). The plot illustrates median scores, quartiles, and individual data points for both models. A paired t-test revealed significant differences between the models ($p < 0.01$).
  • Figure 3: Example search result sequences retrieved for each scenario in Task 2. Scenario 1 focuses on search result pages, Scenario 2 on product detail pages, and Scenario 3 on multi-tab interfaces.