Object-Driven Narrative in AR: A Scenario-Metaphor Framework with VLM Integration
Yusi Sun, Haoyan Guan, leith Kin Yep Chan, Yong Hong Kuo
TL;DR
This work targets the insufficiency of label-based AR storytelling by introducing a scenario-metaphor framework that unites VLMs with spatial AR through state-aware object semantics, a structured JSON narrative interface, and the STAM evaluation framework. The authors implement an end-to-end pipeline where VLM-generated metaphors are grounded in AR anchors, enabling environment-derived narratives rather than mere overlays. Across three validation modules (foundational capability, cognitive alignment, and system integration), the approach improves environmental reinterpretation and spatial engagement, while exposing tensions between AI creativity and deterministic AR grounding, particularly in 3D localization and narrative coherence over time. The study points toward a paradigm shift from preset spatial scripts to narrative emergence driven by environmental intelligence, with future work focusing on unified embedding spaces, cross-modal content generation, and cultural adaptation of metaphors.
Abstract
Most adaptive AR storytelling systems define environmental semantics using simple object labels and spatial coordinates, limiting narratives to rigid, pre-defined logic. This oversimplification overlooks the contextual significance of object relationships-for example, a wedding ring on a nightstand might suggest marital conflict, yet is treated as just "two objects" in space. To address this, we explored integrating Vision Language Models (VLMs) into AR pipelines. However, several challenges emerged: First, stories generated with simple prompt guidance lacked narrative depth and spatial usage. Second, spatial semantics were underutilized, failing to support meaningful storytelling. Third, pre-generated scripts struggled to align with AR Foundation's object naming and coordinate systems. We propose a scene-driven AR storytelling framework that reimagines environments as active narrative agents, built on three innovations: 1. State-aware object semantics: We decompose object meaning into physical, functional, and metaphorical layers, allowing VLMs to distinguish subtle narrative cues between similar objects. 2. Structured narrative interface: A bidirectional JSON layer maps VLM-generated metaphors to AR anchors, maintaining spatial and semantic coherence. 3. STAM evaluation framework: A three-part experimental design evaluates narrative quality, highlighting both strengths and limitations of VLM-AR integration. Our findings show that the system can generate stories from the environment itself, not just place them on top of it. In user studies, 70% of participants reported seeing real-world objects differently when narratives were grounded in environmental symbolism. By merging VLMs' generative creativity with AR's spatial precision, this framework introduces a novel object-driven storytelling paradigm, transforming passive spaces into active narrative landscapes.
