Table of Contents
Fetching ...

SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in Long-horizon Embodied Scenarios

Jieru Lin, Zhiwei Yu, Börje F. Karlsson

TL;DR

SWITCH addresses a critical gap in embodied AI by evaluating how models perceive, reason about, and act within tangible control interfaces (TCIs) in real-world settings. It introduces SWITCH-Basic, a data-rich, task-driven benchmark spanning five capabilities—Task-Aware Visual Question Answering, Semantic UI Grounding, Action Generation, State Transition Prediction, and Result Verification—across 351 tasks and 98 devices using egocentric RGB video, with open data, code, and held-out splits to enable reproducible research. Evaluations of leading LMMMs reveal inconsistent and often text-reliant performance, highlighting gaps in grounding, environment-state reasoning, and outcome verification. Probing world modeling with a video-generation model (Veo3) further uncovers fundamental limitations, including physical plausibility violations and UI misinterpretations, underscoring the need for more grounded, diagnostically rich benchmarks. By releasing SWITCH as an open, evolving platform, the authors aim to accelerate progress toward robust, real-world interactive AI that can perceive, manipulate, and verify actions within complex environments.

Abstract

Autonomous intelligence requires not only perception and reasoning, but critically, effective interaction with the existing world and its infrastructure. Everyday environments are rich in tangible control interfaces (TCIs), e.g., light switches, appliance panels, and embedded GUIs, that demand commonsense and physics reasoning, but also causal prediction and outcome verification in time and space (e.g., delayed heating, remote lights). Moreover, failures here have potential safety implications, yet current benchmarks rarely test grounding, partial observability (video), or post-hoc verification in situated settings. We introduce SWITCH (Semantic World Interface Tasks for Control and Handling), an embodied, task-driven benchmark created through iterative releases to probe these gaps. Its first iteration, SWITCH-Basic, evaluates five complementary abilities:task-aware VQA, semantic UI grounding, action generation, state-transition prediction, and result verification, under egocentric RGB video input and device diversity. Across 351 tasks spanning 98 real devices and appliances, commercial and open LMMMs exhibit inconsistent performance even on single-step interactions, often over-relying on textual cues and under-using visual or video evidence (and high aggregate scores can mask such failures). SWITCH provides data, code, and held-out splits to enable reproducible evaluation and community contributions toward more challenging future iterations of the benchmark and the creation of training datasets. Benchmark resources are available at: https://github.com/BAAI-Agents/SWITCH.

SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in Long-horizon Embodied Scenarios

TL;DR

SWITCH addresses a critical gap in embodied AI by evaluating how models perceive, reason about, and act within tangible control interfaces (TCIs) in real-world settings. It introduces SWITCH-Basic, a data-rich, task-driven benchmark spanning five capabilities—Task-Aware Visual Question Answering, Semantic UI Grounding, Action Generation, State Transition Prediction, and Result Verification—across 351 tasks and 98 devices using egocentric RGB video, with open data, code, and held-out splits to enable reproducible research. Evaluations of leading LMMMs reveal inconsistent and often text-reliant performance, highlighting gaps in grounding, environment-state reasoning, and outcome verification. Probing world modeling with a video-generation model (Veo3) further uncovers fundamental limitations, including physical plausibility violations and UI misinterpretations, underscoring the need for more grounded, diagnostically rich benchmarks. By releasing SWITCH as an open, evolving platform, the authors aim to accelerate progress toward robust, real-world interactive AI that can perceive, manipulate, and verify actions within complex environments.

Abstract

Autonomous intelligence requires not only perception and reasoning, but critically, effective interaction with the existing world and its infrastructure. Everyday environments are rich in tangible control interfaces (TCIs), e.g., light switches, appliance panels, and embedded GUIs, that demand commonsense and physics reasoning, but also causal prediction and outcome verification in time and space (e.g., delayed heating, remote lights). Moreover, failures here have potential safety implications, yet current benchmarks rarely test grounding, partial observability (video), or post-hoc verification in situated settings. We introduce SWITCH (Semantic World Interface Tasks for Control and Handling), an embodied, task-driven benchmark created through iterative releases to probe these gaps. Its first iteration, SWITCH-Basic, evaluates five complementary abilities:task-aware VQA, semantic UI grounding, action generation, state-transition prediction, and result verification, under egocentric RGB video input and device diversity. Across 351 tasks spanning 98 real devices and appliances, commercial and open LMMMs exhibit inconsistent performance even on single-step interactions, often over-relying on textual cues and under-using visual or video evidence (and high aggregate scores can mask such failures). SWITCH provides data, code, and held-out splits to enable reproducible evaluation and community contributions toward more challenging future iterations of the benchmark and the creation of training datasets. Benchmark resources are available at: https://github.com/BAAI-Agents/SWITCH.

Paper Structure

This paper contains 16 sections, 14 figures, 5 tables.

Figures (14)

  • Figure 1: An overview of the SWITCH benchmark, using the case "Turn off all the lights" as a running example. SWITCH covers the collection and annotation of real-world TCI interaction data ("Collected Data"), which we systematically structure into five distinct tasks. These tasks are designed to evaluate models across three crucial capability dimensions: Perception/Spatial Reasoning, Causal Reasoning/Planning, and Verification. Furthermore, we leverage the benchmark to evaluate advanced generative models, like Veo3 veo3. By comparing generated videos against ground truth, we illustrate how current models still exhibit significant room for improvement in logical consistency and fine-grained interaction for real-word use, thus underscoring the importance of SWITCH's target scenarios.
  • Figure 2: Example tasks modeled in SWITCH. Device Instruction (Goal) - Interface Understanding & Action - Effect / Verification.
  • Figure 3: Example of a one-step task. The agent interprets the user ask and identifies related TCI elements (timer, power), generates corresponding actions (only a timer change is needed). The state transition occurs as the switch settings change. Since the agent is close to the device during operation, result verification planning involves moving back to capture the full view of the microwave for reliable result verification and waiting.
  • Figure 4: Example of a multi-step document printing task demonstrating complex interaction reasoning. Given the a user instruction, the agent must perform sequential Semantic UI Comprehension to interpret changing interface layouts, and execute corresponding Action Generation steps (A1–A4). Throughout the process, the model observes state transitions as the UI updates after each interaction, requiring adaptive planning based on visual context. Finally, through Result Verification Planning, the agent repositions to check the printed output, completing the full perception–action–verification loop in a dynamic, real-world setting.
  • Figure 5: Examples of two action categories. Left: UI Action. Right: Procedural Action.
  • ...and 9 more figures