Revisiting put-that-there, context aware window interactions via LLMs
Riccardo Bovo, Daniele Giunchi, Pasquale Cascarano, Eric J. Gonzalez, Mar Gonzalez-Franco
TL;DR
The paper tackles the cognitive burden of window management in panoramic XR workspaces by integrating Large Language Models with multimodal sensing (speech, pointing, head gaze) and scene semantics. It introduces a task-centric, goal-driven approach that maps high-level user intents to coordinated window actions across multiple applications, guided by surface semantics and visibility cues. Key contributions include a hybrid semantic scene understanding pipeline (automatic Quest labels plus manual augmentation), flat-surface anchoring for ergonomic window placement, and a WindowMirror-based workspace that enables LLM-guided, one-to-many action generation. This approach promises reduced cognitive load and more seamless, coherent XR productivity workflows, with future work focusing on empirical user studies to quantify benefits and refine interaction models.
Abstract
We revisit Bolt's classic "Put-That-There" concept for modern head-mounted displays by pairing Large Language Models (LLMs) with XR sensor and tech stack. The agent fuses (i) a semantically segmented 3-D environment, (ii) live application metadata, and (iii) users' verbal, pointing, and head-gaze cues to issue JSON window-placement actions. As a result, users can manage a panoramic workspace through: (1) explicit commands ("Place Google Maps on the coffee table"), (2) deictic speech plus gestures ("Put that there"), or (3) high-level goals ("I need to send a message"). Unlike traditional explicit interfaces, our system supports one-to-many action mappings and goal-centric reasoning, allowing the LLM to dynamically infer relevant applications and layout decisions, including interrelationships across tools. This enables seamless, intent-driven interaction without manual window juggling in immersive XR environments.
