SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in Long-horizon Embodied Scenarios

Jieru Lin; Zhiwei Yu; Börje F. Karlsson

SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in Long-horizon Embodied Scenarios

Jieru Lin, Zhiwei Yu, Börje F. Karlsson

TL;DR

SWITCH addresses a critical gap in embodied AI by evaluating how models perceive, reason about, and act within tangible control interfaces (TCIs) in real-world settings. It introduces SWITCH-Basic, a data-rich, task-driven benchmark spanning five capabilities—Task-Aware Visual Question Answering, Semantic UI Grounding, Action Generation, State Transition Prediction, and Result Verification—across 351 tasks and 98 devices using egocentric RGB video, with open data, code, and held-out splits to enable reproducible research. Evaluations of leading LMMMs reveal inconsistent and often text-reliant performance, highlighting gaps in grounding, environment-state reasoning, and outcome verification. Probing world modeling with a video-generation model (Veo3) further uncovers fundamental limitations, including physical plausibility violations and UI misinterpretations, underscoring the need for more grounded, diagnostically rich benchmarks. By releasing SWITCH as an open, evolving platform, the authors aim to accelerate progress toward robust, real-world interactive AI that can perceive, manipulate, and verify actions within complex environments.

Abstract

Autonomous intelligence requires not only perception and reasoning, but critically, effective interaction with the existing world and its infrastructure. Everyday environments are rich in tangible control interfaces (TCIs), e.g., light switches, appliance panels, and embedded GUIs, that demand commonsense and physics reasoning, but also causal prediction and outcome verification in time and space (e.g., delayed heating, remote lights). Moreover, failures here have potential safety implications, yet current benchmarks rarely test grounding, partial observability (video), or post-hoc verification in situated settings. We introduce SWITCH (Semantic World Interface Tasks for Control and Handling), an embodied, task-driven benchmark created through iterative releases to probe these gaps. Its first iteration, SWITCH-Basic, evaluates five complementary abilities:task-aware VQA, semantic UI grounding, action generation, state-transition prediction, and result verification, under egocentric RGB video input and device diversity. Across 351 tasks spanning 98 real devices and appliances, commercial and open LMMMs exhibit inconsistent performance even on single-step interactions, often over-relying on textual cues and under-using visual or video evidence (and high aggregate scores can mask such failures). SWITCH provides data, code, and held-out splits to enable reproducible evaluation and community contributions toward more challenging future iterations of the benchmark and the creation of training datasets. Benchmark resources are available at: https://github.com/BAAI-Agents/SWITCH.

SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in Long-horizon Embodied Scenarios

TL;DR

Abstract

SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in Long-horizon Embodied Scenarios

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)