Table of Contents
Fetching ...

Tell Me What's Next: Textual Foresight for Generic UI Representations

Andrea Burns, Kate Saenko, Bryan A. Plummer

TL;DR

This work introduces Textual Foresight, a generation-based pretraining objective that predicts a future UI state conditioned on a current screen and a localized action. By training a vision-language model—built on BLIP-2—to generate global captions for the next screen, the method blends local element semantics with global screen context and learns transferable UI representations. The authors curate OpenApp, a public dataset with element- and screen-level captions and Textual Foresight triplets, enabling open benchmarking across four UI tasks: screen summarization, element captioning, tappability, and grounding. Empirically, Textual Foresight achieves competitive or superior results on generation tasks and strong gains on predictive tasks, using substantially less pretraining data than prior state-of-the-art approaches, and demonstrates favorable data efficiency and open-source reproducibility with a standardized benchmark.

Abstract

Mobile app user interfaces (UIs) are rich with action, text, structure, and image content that can be utilized to learn generic UI representations for tasks like automating user commands, summarizing content, and evaluating the accessibility of user interfaces. Prior work has learned strong visual representations with local or global captioning losses, but fails to retain both granularities. To combat this, we propose Textual Foresight, a novel pretraining objective for learning UI screen representations. Textual Foresight generates global text descriptions of future UI states given a current UI and local action taken. Our approach requires joint reasoning over elements and entire screens, resulting in improved UI features: on generation tasks, UI agents trained with Textual Foresight outperform state-of-the-art by 2% with 28x fewer images. We train with our newly constructed mobile app dataset, OpenApp, which results in the first public dataset for app UI representation learning. OpenApp enables new baselines, and we find Textual Foresight improves average task performance over them by 5.7% while having access to 2x less data.

Tell Me What's Next: Textual Foresight for Generic UI Representations

TL;DR

This work introduces Textual Foresight, a generation-based pretraining objective that predicts a future UI state conditioned on a current screen and a localized action. By training a vision-language model—built on BLIP-2—to generate global captions for the next screen, the method blends local element semantics with global screen context and learns transferable UI representations. The authors curate OpenApp, a public dataset with element- and screen-level captions and Textual Foresight triplets, enabling open benchmarking across four UI tasks: screen summarization, element captioning, tappability, and grounding. Empirically, Textual Foresight achieves competitive or superior results on generation tasks and strong gains on predictive tasks, using substantially less pretraining data than prior state-of-the-art approaches, and demonstrates favorable data efficiency and open-source reproducibility with a standardized benchmark.

Abstract

Mobile app user interfaces (UIs) are rich with action, text, structure, and image content that can be utilized to learn generic UI representations for tasks like automating user commands, summarizing content, and evaluating the accessibility of user interfaces. Prior work has learned strong visual representations with local or global captioning losses, but fails to retain both granularities. To combat this, we propose Textual Foresight, a novel pretraining objective for learning UI screen representations. Textual Foresight generates global text descriptions of future UI states given a current UI and local action taken. Our approach requires joint reasoning over elements and entire screens, resulting in improved UI features: on generation tasks, UI agents trained with Textual Foresight outperform state-of-the-art by 2% with 28x fewer images. We train with our newly constructed mobile app dataset, OpenApp, which results in the first public dataset for app UI representation learning. OpenApp enables new baselines, and we find Textual Foresight improves average task performance over them by 5.7% while having access to 2x less data.
Paper Structure (35 sections, 7 equations, 6 figures, 9 tables)

This paper contains 35 sections, 7 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Textual Foresight vs. Element Captioning. While both Element Captioning and Textual Foresight pretraining aim to preserve the semantics of individual UI objects, Textual Foresight also requires understanding global UI semantics of the current screen and how an action on the UI will change it, as the objective is to generate the global description of the following screen. We highlight in red the UI object associated with the input bounding box coordinates.
  • Figure 2: Prior Work Comparison. We divide pretraining objectives by loss type (prediction vs. generation) and use of interaction (includes UI actions or only concern static UIs in isolation). We bold Textual Foresight and Element Captioning as they only use the rendered screen to represent the UI.
  • Figure 3: Textual Foresight. We illustrate how app states from action sequences are used in (current screen, current action, next screen) triplets to pretrain a vision-language model for UI representations. We only use the app screen to represent the UI, and additionally feed in an action question which asks what would we expect to see at the next state if we interact with a particular UI element. Our model decodes a text description of the following screen, using action to bridge local element and global screen features. More Textual Foresight examples are in Appendix \ref{['sec:dataexappendix']}.
  • Figure 4: UI Downstream Task Examples. We illustrate samples from the Screen2Words screen summarization benchmark screen2words, the Widget Caption element captioning task widgetcap, the Tappability classification task tappability, and, lastly, the MUG language grounding benchmark li2022mug.
  • Figure 5: Examples from the OpenApp dataset. We show example new captions we build for OpenApp from the element captioning, element list captioning, screen captioning, and textual foresight sample sets.
  • ...and 1 more figures