UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity

Yicheng Fu; Raviteja Anantha; Prabal Vashisht; Jianpeng Cheng; Etai Littwin

UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity

Yicheng Fu, Raviteja Anantha, Prabal Vashisht, Jianpeng Cheng, Etai Littwin

TL;DR

UI-JEPA introduces a lightweight, on-device approach for predicting user intent from on-screen actions by fusing a JEPA-based video encoder with a decoder-only LLM. The framework learns abstract UI embeddings through self-supervised temporal masking on unlabeled UI videos and achieves intent prediction with far lower compute and latency than large multimodal LLMs, while maintaining competitive accuracy. Two new benchmarks, Intent in the Wild (IIW) and Intent in the Tame (IIT), establish few-shot and zero-shot UI understanding baselines, with UI-JEPA demonstrating strong performance and clear data efficiency gains. The work highlights practical pathways for privacy-preserving, low-resource UI understanding and enables useful applications in user feedback collection and multimodal intent state tracking.

Abstract

Generating user intent from a sequence of user interface (UI) actions is a core challenge in comprehensive UI understanding. Recent advancements in multimodal large language models (MLLMs) have led to substantial progress in this area, but their demands for extensive model parameters, computing power, and high latency makes them impractical for scenarios requiring lightweight, on-device solutions with low latency or heightened privacy. Additionally, the lack of high-quality datasets has hindered the development of such lightweight models. To address these challenges, we propose UI-JEPA, a novel framework that employs masking strategies to learn abstract UI embeddings from unlabeled data through self-supervised learning, combined with an LLM decoder fine-tuned for user intent prediction. We also introduce two new UI-grounded multimodal datasets, "Intent in the Wild" (IIW) and "Intent in the Tame" (IIT), designed for few-shot and zero-shot UI understanding tasks. IIW consists of 1.7K videos across 219 intent categories, while IIT contains 914 videos across 10 categories. We establish the first baselines for these datasets, showing that representations learned using a JEPA-style objective, combined with an LLM decoder, can achieve user intent predictions that match the performance of state-of-the-art large MLLMs, but with significantly reduced annotation and deployment resources. Measured by intent similarity scores, UI-JEPA outperforms GPT-4 Turbo and Claude 3.5 Sonnet by 10.0% and 7.2% respectively, averaged across two datasets. Notably, UI-JEPA accomplishes the performance with a 50.5x reduction in computational cost and a 6.6x improvement in latency in the IIW dataset. These results underscore the effectiveness of UI-JEPA, highlighting its potential for lightweight, high-performance UI understanding.

UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity

TL;DR

Abstract

Paper Structure (37 sections, 12 figures, 11 tables)

This paper contains 37 sections, 12 figures, 11 tables.

Introduction
Related Work
UI Understanding
Multimodal Large Language Models
Self Supervised Learning
The UI-JEPA Framework
Network Parameterization
Training
UI-JEPA Data Strategy
Visualization of UI-JEPA embeddings
The UI-JEPA Benchmarks
Intent in the Wild
Intent in the Tame
Baselines
Results
...and 22 more sections

Figures (12)

Figure 1: 3D Scatter Plots Comparing Benchmark Scores with Model Size and Latency in Intent in the Wild and Intent in the Tame dataset respectively: (a) the relationship between model size (in billions of parameters), latency (in milliseconds), and Intent similarity scores; (b) the same relationship but for Intent in the Tame dataset. Each point represents a different model.
Figure 2: (a) Training Process of UI-JEPA: The training process consists of two stages: (1) JEPA tuning Stage: The pre-trained x-encoder, y-encoder, and predictor are further fine-tuned on our UI datasets using various masking techniques. (2) LLM Fine-tuning Stage: The parameters of the x-encoder from the previous stage is frozen. The video embedding is combined with text tokens embeddings, and fed together as inputs to the large language model to generate an output embedding. The final loss is computed based only on the text portion of the output, excluding the video portion; (b) Inference Process of UI-JEPA: During inference, the video embedding and text embeddings are input into the language model to generate a prediction of user intent.
Figure 3: 2D Visualization of Video Embeddings: The left panel shows the 2D embedding representation of videos from the "Intent in the Wild" dataset using a random encoder, while the right panel displays the embeddings generated by the UI-JEPA encoder.
Figure 4: Examples of inputs and corresponding labels from the IIW and IIT datasets. In the IIW dataset, the input is a sequence of UI actions in a single video, labeled with a high-level, delexicalized description of user intent. In contrast, the IIT dataset uses lexicalized intent as the label. In addition, it includes OCR texts converted from the final video frame.
Figure 5:
...and 7 more figures

UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity

TL;DR

Abstract

UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity

Authors

TL;DR

Abstract

Table of Contents

Figures (12)