Mapping Natural Language Commands to Web Elements
Panupong Pasupat, Tian-Shun Jiang, Evan Zheran Liu, Kelvin Guu, Percy Liang
TL;DR
We address grounding natural language commands in open-ended web pages by selecting the corresponding DOM element. The authors release a large dataset of 51,663 commands over 1,835 pages and propose three baselines—retrieval, embedding-based, and alignment-based—featuring text, attributes, and spatial cues. Embedding and alignment approaches substantially outperform the retrieval baseline, with text content emerging as the most informative signal and spatial context offering nuanced gains. The work advances natural language interfaces for web navigation and automation and provides code and data to spur future improvements in NL-to-UI grounding.
Abstract
The web provides a rich, open-domain environment with textual, structural, and spatial properties. We propose a new task for grounding language in this environment: given a natural language command (e.g., "click on the second article"), choose the correct element on the web page (e.g., a hyperlink or text box). We collected a dataset of over 50,000 commands that capture various phenomena such as functional references (e.g. "find who made this site"), relational reasoning (e.g. "article by john"), and visual reasoning (e.g. "top-most article"). We also implemented and analyzed three baseline models that capture different phenomena present in the dataset.
