Table of Contents
Fetching ...

Sensible Agent: A Framework for Unobtrusive Interaction with Proactive AR Agents

Geonsun Lee, Min Xia, Nels Numan, Xun Qian, David Li, Yanhe Chen, Achin Kulshrestha, Ishan Chatterjee, Yinda Zhang, Dinesh Manocha, David Kim, Ruofei Du

TL;DR

Sensible Agent tackles the problem of disruptive interaction with proactive AR agents by jointly optimizing what the agent should offer and how it should be delivered, guided by real-time multimodal context. The framework comprises an Action Recommendation module (What) and an Interaction Adaption module (How), both conditioned on context and social factors, and realized in a WebXR prototype with LLM-based reasoning and multiple input modalities. Across studies, Sensible Agent reduces perceived interaction effort and remains highly usable, while enabling user preferences to shape modality choice, demonstrating practical benefits for unobtrusive, context-aware AR assistance. The work lays groundwork for scalable, socially aware proactive AR systems and points to future extensions in cross-device orchestration and longitudinal personalization within ambient computing environments.

Abstract

Proactive AR agents promise context-aware assistance, but their interactions often rely on explicit voice prompts or responses, which can be disruptive or socially awkward. We introduce Sensible Agent, a framework designed for unobtrusive interaction with these proactive agents. Sensible Agent dynamically adapts both "what" assistance to offer and, crucially, "how" to deliver it, based on real-time multimodal context sensing. Informed by an expert workshop (n=12) and a data annotation study (n=40), the framework leverages egocentric cameras, multimodal sensing, and Large Multimodal Models (LMMs) to infer context and suggest appropriate actions delivered via minimally intrusive interaction modes. We demonstrate our prototype on an XR headset through a user study (n=10) in both AR and VR scenarios. Results indicate that Sensible Agent significantly reduces perceived interaction effort compared to voice-prompted baseline, while maintaining high usability and achieving higher preference.

Sensible Agent: A Framework for Unobtrusive Interaction with Proactive AR Agents

TL;DR

Sensible Agent tackles the problem of disruptive interaction with proactive AR agents by jointly optimizing what the agent should offer and how it should be delivered, guided by real-time multimodal context. The framework comprises an Action Recommendation module (What) and an Interaction Adaption module (How), both conditioned on context and social factors, and realized in a WebXR prototype with LLM-based reasoning and multiple input modalities. Across studies, Sensible Agent reduces perceived interaction effort and remains highly usable, while enabling user preferences to shape modality choice, demonstrating practical benefits for unobtrusive, context-aware AR assistance. The work lays groundwork for scalable, socially aware proactive AR systems and points to future extensions in cross-device orchestration and longitudinal personalization within ambient computing environments.

Abstract

Proactive AR agents promise context-aware assistance, but their interactions often rely on explicit voice prompts or responses, which can be disruptive or socially awkward. We introduce Sensible Agent, a framework designed for unobtrusive interaction with these proactive agents. Sensible Agent dynamically adapts both "what" assistance to offer and, crucially, "how" to deliver it, based on real-time multimodal context sensing. Informed by an expert workshop (n=12) and a data annotation study (n=40), the framework leverages egocentric cameras, multimodal sensing, and Large Multimodal Models (LMMs) to infer context and suggest appropriate actions delivered via minimally intrusive interaction modes. We demonstrate our prototype on an XR headset through a user study (n=10) in both AR and VR scenarios. Results indicate that Sensible Agent significantly reduces perceived interaction effort compared to voice-prompted baseline, while maintaining high usability and achieving higher preference.

Paper Structure

This paper contains 70 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Detailed dataflow of the Sensible Agent framework. An ACTION RECOMMENDATION MODULE (WHAT) takes user context and determines the suggested action in one of three primary formats, and an INTERACTION ADAPTION MODULE (HOW) selects presentation modality and input modalities.
  • Figure 2: System architecture of our proactive AR agent prototype. The full system is implemented in WebXR with support for real-time interaction in 360$^{\circ}$ videos or video see-through AR environments. The system processes visual and audio input (1) and parses contextual attributes such as familiarity, urgency, and environmental noise using a VLM and YAMNet. (2) Based on the parsed context, the proactive query generator formulates a suitable suggestion, including its agent action, presentation modality, and query type. These are passed to the interaction module, (3) where the UI manager renders the query and the (4) input modality manager enables one or more input modalities ( e.g., gaze, hand, head, voice) based on feasibility and appropriateness. The interaction module then forwards the selected option by the user to the (5) response generator.
  • Figure 3: Web interface for the data annotation study. Each participant annotated 24 scenarios through a 3-step workflow. In Step 1, participants viewed: (1a) a short text describing the scenario, (1b) a synthetic egocentric image for visual consistency, (1c) contextual details ( e.g., location, engagement), and (1d) a text input field to describe the desired proactive AR agent action. In Step 2, the input was converted into (2a) multi-choice, (2b) binary, and (2c) icon-style queries using LLMs. Participants could edit or choose their preferred query type. In Step 3, they (3a) rated the usefulness of the action and (3b) selected the preferred presentation modality (audio, visual, or both). Final responses were exported as a CSV after completing all 24 scenarios.
  • Figure 4: Distribution of data entries in selected presentation modes (left) and query types (right) across different high-level activities and context variants. Data showed varying preferences for modality (audio, visual, audiovisual) and query format (binary, multiple-choice, icon-based) depending on situational demands and activity type.
  • Figure 5: Applications. A-D): Sensible Agent's initial query (I) and repetitive query (R), based on the same daily scenarios. A) Gym visit-I. B) Gym visit-R. C) Restaurant order-I. D) Restaurant order-R. E) Novel feature suggestion: virtual try-on. F) Subtle cues: information retrieval. G) Effortless smart device control. H) Future application: Human-robot interaction.
  • ...and 4 more figures