Table of Contents
Fetching ...

ClickAIXR: On-Device Multimodal Vision-Language Interaction with Real-World Objects in Extended Reality

Dawar Khan, Alexandre Kouyoumdjian, Xinyu Liu, Omar Mena, Dominik Engel, Ivan Viola

Abstract

We present ClickAIXR, a novel on-device framework for multimodal vision-language interaction with objects in extended reality (XR). Unlike prior systems that rely on cloud-based AI (e.g., ChatGPT) or gaze-based selection (e.g., GazePointAR), ClickAIXR integrates an on-device vision-language model (VLM) with a controller-based object selection paradigm, enabling users to precisely click on real-world objects in XR. Once selected, the object image is processed locally by the VLM to answer natural language questions through both text and speech. This object-centered interaction reduces ambiguity inherent in gaze- or voice-only interfaces and improves transparency by performing all inference on-device, addressing concerns around privacy and latency. We implemented ClickAIXR in the Magic Leap SDK (C API) with ONNX-based local VLM inference. We conducted a user study comparing ClickAIXR with Gemini 2.5 Flash and ChatGPT 5, evaluating usability, trust, and user satisfaction. Results show that latency is moderate and user experience is acceptable. Our findings demonstrate the potential of click-based object selection combined with on-device AI to advance trustworthy, privacy-preserving XR interactions. The source code and supplementary materials are available at: nanovis.org/ClickAIXR.html

ClickAIXR: On-Device Multimodal Vision-Language Interaction with Real-World Objects in Extended Reality

Abstract

We present ClickAIXR, a novel on-device framework for multimodal vision-language interaction with objects in extended reality (XR). Unlike prior systems that rely on cloud-based AI (e.g., ChatGPT) or gaze-based selection (e.g., GazePointAR), ClickAIXR integrates an on-device vision-language model (VLM) with a controller-based object selection paradigm, enabling users to precisely click on real-world objects in XR. Once selected, the object image is processed locally by the VLM to answer natural language questions through both text and speech. This object-centered interaction reduces ambiguity inherent in gaze- or voice-only interfaces and improves transparency by performing all inference on-device, addressing concerns around privacy and latency. We implemented ClickAIXR in the Magic Leap SDK (C API) with ONNX-based local VLM inference. We conducted a user study comparing ClickAIXR with Gemini 2.5 Flash and ChatGPT 5, evaluating usability, trust, and user satisfaction. Results show that latency is moderate and user experience is acceptable. Our findings demonstrate the potential of click-based object selection combined with on-device AI to advance trustworthy, privacy-preserving XR interactions. The source code and supplementary materials are available at: nanovis.org/ClickAIXR.html

Paper Structure

This paper contains 17 sections, 3 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 2: Overview of the ClickAIXR pipeline. Users choose between (i) dwell mode, where a fixed-size GCW follows gaze and, after a brief dwell, auto-captures an ROI for image captioning or a spoken/text query; or (ii) GCW select-and-ask, where the user places the border-only GCW on the target, adjusts width/height/depth with the controller, and confirms with a trigger. After confirmation, a microphone icon appears; the spoken question is converted to text via on-device ASR, then fused with the cropped image and processed by the on-device (encoder–decoder, tokenizer). The answer is returned to the XR UI as text and to the user as audio via TTS on ML2.
  • Figure 3: Examples of images used for the latency test: top row from COCO LinCoco2014, bottom row from the Book Covers dataset iwana2016judging.
  • Figure 4: User study overview and in-headset views on Magic Leap 2. Top: participants interacting with the object table using ClickAIXR. Bottom: representative object layouts and direct in-headset views; the border-only rectangle is the gaze-locked clipping window (GCW), which participants are positioning and resizing before capture.
  • Figure 5: Some of the objects we placed in the room used for the study. Their close proximity to one another often leads to overlap once photographed from a particular angle, which requires disambiguation.
  • Figure 6: Mean scores (0--100) for ChatGPT, Gemini, and ClickAIXR. Bars show mean values; error bars indicate $\pm$95% confidence intervals (CI). Results: Gemini = $81.9 \pm 11.2$ (SD), CI $\pm 6.36$; ChatGPT = $76.7 \pm 15.8$ (SD), CI $\pm 8.93$; ClickAIXR = $60.0 \pm 17.1$ (SD), CI $\pm 9.65$.
  • ...and 2 more figures