Table of Contents
Fetching ...

An Exploratory Study on Multi-modal Generative AI in AR Storytelling

Hyungjun Doh, Jingyu Shi, Rahul Jain, Heesoo Kim, Karthik Ramani

TL;DR

This study defines a design-space for multi-modal Gen-AI in AR storytelling by analyzing 223 AR videos and building a testbed that supports five modalities (Text, Audio, Image, Video, 3D) and four atomic storytelling elements (Character, Background, Sentiment, Development). Through two studies with 30 experienced storytellers, it investigates modality preferences, interaction with AI, and the quality of AI-generated content, finding that images suit characters and backgrounds well while video supports development, though video quality often limits alignment with intent. Participants generally found co-creative AI interactions easy but noted that guiding outputs via prompts remains hard, underscoring the need for context-aware, selective augmentation and richer AR interactions. The work contributes a concrete design-space, a functional testbed leveraging Motion-Diffusion-Model, Text2Video-Zero, Stable Diffusion, MusicGen, and DreamFusion, and actionable design recommendations for future AR storytelling systems employing Gen-AI.

Abstract

Storytelling in AR has gained attention due to its multi-modality and interactivity. However, generating multi-modal content for AR storytelling requires expertise and efforts for high-quality conveyance of the narrator's intention. Recently, Generative-AI (GenAI) has shown promising applications in multi-modal content generation. Despite the potential benefit, current research calls for validating the effect of AI-generated content (AIGC) in AR Storytelling. Therefore, we conducted an exploratory study to investigate the utilization of GenAI. Analyzing 223 AR videos, we identified a design space for multi-modal AR Storytelling. Based on the design space, we developed a testbed facilitating multi-modal content generation and atomic elements in AR Storytelling. Through two studies with N=30 experienced storytellers and live presenters, we 1. revealed participants' preferences for modalities, 2. evaluated the interactions with AI to generate content, and 3. assessed the quality of the AIGC for AR Storytelling. We further discussed design considerations for future AR Storytelling with GenAI.

An Exploratory Study on Multi-modal Generative AI in AR Storytelling

TL;DR

This study defines a design-space for multi-modal Gen-AI in AR storytelling by analyzing 223 AR videos and building a testbed that supports five modalities (Text, Audio, Image, Video, 3D) and four atomic storytelling elements (Character, Background, Sentiment, Development). Through two studies with 30 experienced storytellers, it investigates modality preferences, interaction with AI, and the quality of AI-generated content, finding that images suit characters and backgrounds well while video supports development, though video quality often limits alignment with intent. Participants generally found co-creative AI interactions easy but noted that guiding outputs via prompts remains hard, underscoring the need for context-aware, selective augmentation and richer AR interactions. The work contributes a concrete design-space, a functional testbed leveraging Motion-Diffusion-Model, Text2Video-Zero, Stable Diffusion, MusicGen, and DreamFusion, and actionable design recommendations for future AR storytelling systems employing Gen-AI.

Abstract

Storytelling in AR has gained attention due to its multi-modality and interactivity. However, generating multi-modal content for AR storytelling requires expertise and efforts for high-quality conveyance of the narrator's intention. Recently, Generative-AI (GenAI) has shown promising applications in multi-modal content generation. Despite the potential benefit, current research calls for validating the effect of AI-generated content (AIGC) in AR Storytelling. Therefore, we conducted an exploratory study to investigate the utilization of GenAI. Analyzing 223 AR videos, we identified a design space for multi-modal AR Storytelling. Based on the design space, we developed a testbed facilitating multi-modal content generation and atomic elements in AR Storytelling. Through two studies with N=30 experienced storytellers and live presenters, we 1. revealed participants' preferences for modalities, 2. evaluated the interactions with AI to generate content, and 3. assessed the quality of the AIGC for AR Storytelling. We further discussed design considerations for future AR Storytelling with GenAI.

Paper Structure

This paper contains 46 sections, 13 figures.

Figures (13)

  • Figure 1: Our Design Space for Multi-modal AR Storytelling: Modalities and Elements. These examples illustrate how a modality augments each element of AR Storytelling, offering a visual summary of our design space as derived from our analysis of the 223 augmented videos.
  • Figure 2: The testbed workflow. 1) Content Generator interface: The user employs the content generator to create AIGC for AR Storytelling, supporting five modalities for the selected sentence. 2) AR interface: The user can view the text corresponding to spoken words. Based on the transferred speech text, the user can interact with AIGC using hand.
  • Figure 3: The testbed. (a) The Multi-Modal Content Generator interface. (a-i) The textual input section. The user can import a story text file for storytelling by entering the story title. (a-ii) The loaded story. Where the user can see the loaded story and select a sentence for augmentation by dragging it with the mouse cursor. (a-iii) The highlight (top) and save (bottom) buttons. The top button highlights the selected portion with yellow, indicating to the user that this part will be generated. The bottom button saves the highlighted text to the backend for output generation. (a-iv) The modality selection. The user can choose the desired modality for content generation. (a-v) The output. The testbed displays the generated content, and the user can save it using a unique keyword. This keyword acts as a trigger during the storytelling process. (a-vi) The quality evaluation of the content by the user. (b) The AR interface. (b-i) Image. The user can interact with the AIGC using hand landmarks. The content appears on the tip of the user's index finger. (b-ii) The Speech-to-Text box. The interface displays the narrator's words, serving as a trigger for the corresponding content. (b-iii) Text. This modality shows the main keyword and detailed information. (b-iv) Audio. The icon indicates the corresponding audio is playing. (b-v) Video. (b-vi) 3D Content.
  • Figure 4: The distribution of the four elements that emerge in the stories we used for the study
  • Figure 5: Results for Preference of modality
  • ...and 8 more figures