Table of Contents
Fetching ...

KAHANI: Culturally-Nuanced Visual Storytelling Tool for Non-Western Cultures

Hamna, Deepthi Sudharsan, Agrima Seth, Ritvik Budhiraja, Deepika Khullar, Vyshak Jain, Kalika Bali, Aditya Vashistha, Sameer Segal

TL;DR

Kahani addresses the gap in culturally grounded AI storytelling by proposing a model-agnostic pipeline that extracts Cultural Specific Items, generates text, profiles characters, plans scenes, and creates visuals with SDXL, all guided by Chain-of-Thought prompting. In a user study across Indian participants, Kahani produced more culturally nuanced text and visuals than the baseline of ChatGPT-4 with DALL-E3, showing significant improvements in cultural nuance, CSI usage, image consistency, and accuracy of cultural elements. The work introduces rigorous evaluation frameworks including reference-based BLEU-inspired and reference-free MQM-inspired metrics, demonstrating practical benefits for non-Western storytelling contexts. It also discusses ethical considerations, limitations, and future directions such as integrating external knowledge sources and enabling iterative feedback, with code and prompts made publicly available to foster further research in culturally aware AI generation.

Abstract

Large Language Models (LLMs) and Text-To-Image (T2I) models have demonstrated the ability to generate compelling text and visual stories. However, their outputs are predominantly aligned with the sensibilities of the Global North, often resulting in an outsider's gaze on other cultures. As a result, non-Western communities have to put extra effort into generating culturally specific stories. To address this challenge, we developed a visual storytelling tool called Kahani that generates culturally grounded visual stories for non-Western cultures. Our tool leverages off-the-shelf models GPT-4 Turbo and Stable Diffusion XL (SDXL). By using Chain of Thought (CoT) and T2I prompting techniques, we capture the cultural context from user's prompt and generate vivid descriptions of the characters and scene compositions. To evaluate the effectiveness of Kahani, we conducted a comparative user study with ChatGPT-4 (with DALL-E3) in which participants from different regions of India compared the cultural relevance of stories generated by the two tools. The results of the qualitative and quantitative analysis performed in the user study show that Kahani's visual stories are more culturally nuanced than those generated by ChatGPT-4. In 27 out of 36 comparisons, Kahani outperformed or was on par with ChatGPT-4, effectively capturing cultural nuances and incorporating more Culturally Specific Items (CSI), validating its ability to generate culturally grounded visual stories.

KAHANI: Culturally-Nuanced Visual Storytelling Tool for Non-Western Cultures

TL;DR

Kahani addresses the gap in culturally grounded AI storytelling by proposing a model-agnostic pipeline that extracts Cultural Specific Items, generates text, profiles characters, plans scenes, and creates visuals with SDXL, all guided by Chain-of-Thought prompting. In a user study across Indian participants, Kahani produced more culturally nuanced text and visuals than the baseline of ChatGPT-4 with DALL-E3, showing significant improvements in cultural nuance, CSI usage, image consistency, and accuracy of cultural elements. The work introduces rigorous evaluation frameworks including reference-based BLEU-inspired and reference-free MQM-inspired metrics, demonstrating practical benefits for non-Western storytelling contexts. It also discusses ethical considerations, limitations, and future directions such as integrating external knowledge sources and enabling iterative feedback, with code and prompts made publicly available to foster further research in culturally aware AI generation.

Abstract

Large Language Models (LLMs) and Text-To-Image (T2I) models have demonstrated the ability to generate compelling text and visual stories. However, their outputs are predominantly aligned with the sensibilities of the Global North, often resulting in an outsider's gaze on other cultures. As a result, non-Western communities have to put extra effort into generating culturally specific stories. To address this challenge, we developed a visual storytelling tool called Kahani that generates culturally grounded visual stories for non-Western cultures. Our tool leverages off-the-shelf models GPT-4 Turbo and Stable Diffusion XL (SDXL). By using Chain of Thought (CoT) and T2I prompting techniques, we capture the cultural context from user's prompt and generate vivid descriptions of the characters and scene compositions. To evaluate the effectiveness of Kahani, we conducted a comparative user study with ChatGPT-4 (with DALL-E3) in which participants from different regions of India compared the cultural relevance of stories generated by the two tools. The results of the qualitative and quantitative analysis performed in the user study show that Kahani's visual stories are more culturally nuanced than those generated by ChatGPT-4. In 27 out of 36 comparisons, Kahani outperformed or was on par with ChatGPT-4, effectively capturing cultural nuances and incorporating more Culturally Specific Items (CSI), validating its ability to generate culturally grounded visual stories.

Paper Structure

This paper contains 58 sections, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Comparison of visual stories generated for a story prompt by two tools - Kahani and ChatGPT4 + DALL-E3: Out of the two tools, we see that both visually and in text, Kahani demonstrates a significantly better ability to capture the cultural essence of the geography the story is based on.
  • Figure 2: Example of representational harm in state-of-the-art T2I models when asked to create image of a "South Indian girl in a park". The generated images depict "South Indian girl" with an unreasonable amount of ornamentation in a park, largely reinforcing the Western gaze of Indian culture.
  • Figure 3: Overview of our Kahani Visual Storytelling Pipeline. (1) We first extract Cultural Specific Items (CSIs) from an input user prompt and expand on these CSIs; (2) We then generate the text story using the input user prompt and the expanded cultural details to add more cultural context to the story; (3) From the generated story, we extract and create character profiles consisting of visual descriptions and attributes of the characters involved in the story; (4) Using a four-act story arc as a reference, the story is segmented into distinct scenes and each scene is planned and outlined to prepare for visual generation; (5) Finally, we provide T2I prompt template and instructions to the LLM and generate T2I prompts for each scene and generate the story visuals; Stable Diffusion XL (SDXL) was used to generate the visuals, whereas GPT-4 Turbo was used for the rest of the text generations.
  • Figure 4: Scene 1 visual generated for the example (Preeti) story
  • Figure 5: Comparative analysis of Reference-based metric scores
  • ...and 5 more figures