Table of Contents
Fetching ...

I Prompt, it Generates, we Negotiate. Exploring Text-Image Intertextuality in Human-AI Co-Creation of Visual Narratives with VLMs

Mengyao Guo, Kexin Nie, Ze Gao, Black Sun, Xueyang Wang, Jinda Han, Xingting Wu

TL;DR

The paper addresses how text–image intertextuality emerges in human–AI co-creation of sequential visual narratives using Vision-Language systems. It employs a three-phase qualitative study with 15 participants using GPT-4o to observe novice engagement with visual storytelling and intertextual meaning-making. Key contributions include identifying four collaboration patterns, three fsQCA pathways (Educational Collaborator, Technical Expert, Visual Thinker), and a set of challenges (cultural representation, visual coherence, translation of narrative to visuals) along with design implications for role-based AI assistants that support iterative, human-led creativity. The work advances understanding of end-user interactions with cross-modal AI in storytelling and offers practical guidance for designing AI tools that scaffold creativity through structured workflows and memory-enabled, culturally aware capabilities.

Abstract

Creating meaningful visual narratives through human-AI collaboration requires understanding how text-image intertextuality emerges when textual intentions meet AI-generated visuals. We conducted a three-phase qualitative study with 15 participants using GPT-4o to investigate how novices navigate sequential visual narratives. Our findings show that users develop strategies to harness AI's semantic surplus by recognizing meaningful visual content beyond literal descriptions, iteratively refining prompts, and constructing narrative significance through complementary text-image relationships. We identified four distinct collaboration patterns and, through fsQCA's analysis, discovered three pathways to successful intertextual collaboration: Educational Collaborator, Technical Expert, and Visual Thinker. However, participants faced challenges, including cultural representation gaps, visual consistency issues, and difficulties translating narrative concepts into visual prompts. These findings contribute to HCI research by providing an empirical account of \textit{text-image intertextuality} in human-AI co-creation and proposing design implications for role-based AI assistants that better support iterative, human-led creative processes in visual storytelling.

I Prompt, it Generates, we Negotiate. Exploring Text-Image Intertextuality in Human-AI Co-Creation of Visual Narratives with VLMs

TL;DR

The paper addresses how text–image intertextuality emerges in human–AI co-creation of sequential visual narratives using Vision-Language systems. It employs a three-phase qualitative study with 15 participants using GPT-4o to observe novice engagement with visual storytelling and intertextual meaning-making. Key contributions include identifying four collaboration patterns, three fsQCA pathways (Educational Collaborator, Technical Expert, Visual Thinker), and a set of challenges (cultural representation, visual coherence, translation of narrative to visuals) along with design implications for role-based AI assistants that support iterative, human-led creativity. The work advances understanding of end-user interactions with cross-modal AI in storytelling and offers practical guidance for designing AI tools that scaffold creativity through structured workflows and memory-enabled, culturally aware capabilities.

Abstract

Creating meaningful visual narratives through human-AI collaboration requires understanding how text-image intertextuality emerges when textual intentions meet AI-generated visuals. We conducted a three-phase qualitative study with 15 participants using GPT-4o to investigate how novices navigate sequential visual narratives. Our findings show that users develop strategies to harness AI's semantic surplus by recognizing meaningful visual content beyond literal descriptions, iteratively refining prompts, and constructing narrative significance through complementary text-image relationships. We identified four distinct collaboration patterns and, through fsQCA's analysis, discovered three pathways to successful intertextual collaboration: Educational Collaborator, Technical Expert, and Visual Thinker. However, participants faced challenges, including cultural representation gaps, visual consistency issues, and difficulties translating narrative concepts into visual prompts. These findings contribute to HCI research by providing an empirical account of \textit{text-image intertextuality} in human-AI co-creation and proposing design implications for role-based AI assistants that better support iterative, human-led creative processes in visual storytelling.

Paper Structure

This paper contains 49 sections, 22 figures, 3 tables.

Figures (22)

  • Figure 1: Throughout the communication process, cultural traditions have developed distinct approaches to visual representation. East Asian traditions often employ multiple perspectives and rhythmic elements to suggest narrative flow while maintaining holistic views jing2024explorationgreen2013rethinking, as exemplified in "Along the River During the Qingming Festival" yu2023city. In contrast, Western traditions have typically relied on linear perspective and the decisive momentsweet2008dialoguepanofsky2020perspectivecartier1993decisive, as evident in works such as "The Last Supper" and "Las Meninas". These approaches reflect fundamentally different cognitive frameworks in the Eastern and Western worlds for organizing visual information rather than mere stylistic preferences. Differences in observation have yielded distinct models of visual narrative, underscoring the centrality of visual narrative to cultural representation and motivating our focus on text–image intertextuality.
  • Figure 2: The orange square is one key component in the sequential relationship in Calvin and Hobbes, Credit by Bill Watterson.
  • Figure 3: The intertextual relationship between complementary text and image in Home, Credit by Carson Ellis.
  • Figure 4: The intertextual relationship between complementary text and image in The Liszts, Credit by Kyo Maclear and Júlia Sarda.
  • Figure 5: Our research phases, Credit by Authors.
  • ...and 17 more figures