Visual Embedding of Screen Sequences for User-Flow Search in Example-driven Communication
Daeheon Jeong, Hyehyun Chu
TL;DR
The paper tackles the challenge of effectively communicating UX considerations to diverse stakeholders by leveraging example-driven evidence in the form of user-flow screen sequences. It introduces a semantic embedding method that encodes sequences of screens using a ViT-based visual encoder and a one-layer multi-head attention pooler, trained with a symmetric contrastive loss to align with textual task descriptions: $\mathcal{L}_{contrast} = \tfrac{1}{2}[\mathrm{CE}(\mathbf{V}\mathbf{T}^{\mathsf{T}}/\tau) + \mathrm{CE}(\mathbf{T}\mathbf{V}^{\mathsf{T}}/\tau)]$, where $\tau=0.07$. Grounded in a formative study with four UX practitioners and a subsequent human-subject survey (n=21), the approach demonstrates that sequence-based retrieval is more aligned with human perceptions of relevance than text-only baselines and offers practical value as a design reference. The work contributes (1) an in-depth interview study on how practitioners communicate UX considerations, (2) a novel screen-sequence embedding method for semantic search, and (3) empirical evidence supporting integration into design workflows, with implications for computational representations of user flows.
Abstract
Effective communication of UX considerations to stakeholders (e.g., designers and developers) is a critical challenge for UX practitioners. To explore this problem, we interviewed four UX practitioners about their communication challenges and strategies. Our study identifies that providing an example user flow-a screen sequence representing a semantic task-as evidence reinforces communication, yet finding relevant examples remains challenging. To address this, we propose a method to systematically retrieve user flows using semantic embedding. Specifically, we design a model that learns to associate screens' visual features with user flow descriptions through contrastive learning. A survey confirms that our approach retrieves user flows better aligned with human perceptions of relevance. We analyze the results and discuss implications for the computational representation of user flows.
