Table of Contents
Fetching ...

Revisiting put-that-there, context aware window interactions via LLMs

Riccardo Bovo, Daniele Giunchi, Pasquale Cascarano, Eric J. Gonzalez, Mar Gonzalez-Franco

TL;DR

The paper tackles the cognitive burden of window management in panoramic XR workspaces by integrating Large Language Models with multimodal sensing (speech, pointing, head gaze) and scene semantics. It introduces a task-centric, goal-driven approach that maps high-level user intents to coordinated window actions across multiple applications, guided by surface semantics and visibility cues. Key contributions include a hybrid semantic scene understanding pipeline (automatic Quest labels plus manual augmentation), flat-surface anchoring for ergonomic window placement, and a WindowMirror-based workspace that enables LLM-guided, one-to-many action generation. This approach promises reduced cognitive load and more seamless, coherent XR productivity workflows, with future work focusing on empirical user studies to quantify benefits and refine interaction models.

Abstract

We revisit Bolt's classic "Put-That-There" concept for modern head-mounted displays by pairing Large Language Models (LLMs) with XR sensor and tech stack. The agent fuses (i) a semantically segmented 3-D environment, (ii) live application metadata, and (iii) users' verbal, pointing, and head-gaze cues to issue JSON window-placement actions. As a result, users can manage a panoramic workspace through: (1) explicit commands ("Place Google Maps on the coffee table"), (2) deictic speech plus gestures ("Put that there"), or (3) high-level goals ("I need to send a message"). Unlike traditional explicit interfaces, our system supports one-to-many action mappings and goal-centric reasoning, allowing the LLM to dynamically infer relevant applications and layout decisions, including interrelationships across tools. This enables seamless, intent-driven interaction without manual window juggling in immersive XR environments.

Revisiting put-that-there, context aware window interactions via LLMs

TL;DR

The paper tackles the cognitive burden of window management in panoramic XR workspaces by integrating Large Language Models with multimodal sensing (speech, pointing, head gaze) and scene semantics. It introduces a task-centric, goal-driven approach that maps high-level user intents to coordinated window actions across multiple applications, guided by surface semantics and visibility cues. Key contributions include a hybrid semantic scene understanding pipeline (automatic Quest labels plus manual augmentation), flat-surface anchoring for ergonomic window placement, and a WindowMirror-based workspace that enables LLM-guided, one-to-many action generation. This approach promises reduced cognitive load and more seamless, coherent XR productivity workflows, with future work focusing on empirical user studies to quantify benefits and refine interaction models.

Abstract

We revisit Bolt's classic "Put-That-There" concept for modern head-mounted displays by pairing Large Language Models (LLMs) with XR sensor and tech stack. The agent fuses (i) a semantically segmented 3-D environment, (ii) live application metadata, and (iii) users' verbal, pointing, and head-gaze cues to issue JSON window-placement actions. As a result, users can manage a panoramic workspace through: (1) explicit commands ("Place Google Maps on the coffee table"), (2) deictic speech plus gestures ("Put that there"), or (3) high-level goals ("I need to send a message"). Unlike traditional explicit interfaces, our system supports one-to-many action mappings and goal-centric reasoning, allowing the LLM to dynamically infer relevant applications and layout decisions, including interrelationships across tools. This enables seamless, intent-driven interaction without manual window juggling in immersive XR environments.

Paper Structure

This paper contains 15 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: System architecture overview. The LLM receives input from three main modules: (1) Scene Understanding, which provides semantic segmentation and identifies flat surfaces; (2) Window Workspace, which manages digital window elements using the WindowMirror system; and (3) User Behaviour, capturing head direction, pointing, and voice commands. Together, these components allow the LLM to interpret multimodal requests and generate actionable window placement decisions.
  • Figure 2: (A) Semantic segmentation of the scene using the Meta Quest API, showing identified classes such as floor, cabinet, and table. (B) Flat surface detection for placing virtual windows, with mesh overlays illustrating usable planar regions.
  • Figure 3: Task–centric window placement. Instead of naming specific applications or surfaces, users simply state their goals (e.g., “I need to send a message,” “I need some location’s information,” “I need to finish coding my application”). The LLM interprets each high-level task and (1) selects the relevant window, Chat, Google Maps, or Visual Studio, and (2) places it on an appropriate, visible surface. This allows users to think in terms of what they want to accomplish rather than how to manage windows, streamlining workflow in XR.