Table of Contents
Fetching ...

TALES: A Taxonomy and Analysis of Cultural Representations in LLM-generated Stories

Kirti Bhagat, Shaily Bhatt, Athul Velagapudi, Aditya Vashistha, Shachi Dave, Danish Pruthi

TL;DR

TALES introduces a community-informed framework to evaluate cultural representations in LLM-generated stories about India. It develops TALES-Tax to categorize misrepresentations, conducts large-scale multi-language annotations with 108 expert annotators to quantify prevalence, and creates TALES-QA to probe cultural knowledge independently of generation. The findings reveal pervasive misrepresentations, especially in Indic languages and lesser-known regions, but models retain substantial cultural knowledge, highlighting a gap between knowledge and its application in storytelling. The work advocates community-centered evaluation, examines intra-regional diversity, and offers design directions to mitigate failure modes in generative cultural representations.

Abstract

Millions of users across the globe turn to AI chatbots for their creative needs, inviting widespread interest in understanding how such chatbots represent diverse cultures. At the same time, evaluating cultural representations in open-ended tasks remains challenging and underexplored. In this work, we present TALES, an evaluation of cultural misrepresentations in LLM-generated stories for diverse Indian cultural identities. First, we develop TALES-Tax, a taxonomy of cultural misrepresentations by collating insights from participants with lived experiences in India through focus groups (N=9) and individual surveys (N=15). Using TALES-Tax, we evaluate 6 models through a large-scale annotation study spanning 2,925 annotations from 108 annotators with lived cultural experience from across 71 regions in India and 14 languages. Concerningly, we find that 88\% of the generated stories contain one or more cultural inaccuracies, and such errors are more prevalent in mid- and low-resourced languages and stories based in peri-urban regions in India. Lastly, we transform the annotations into TALES-QA, a standalone question bank to evaluate the cultural knowledge of foundational models. Through this evaluation, we surprisingly discover that models often possess the requisite cultural knowledge despite generating stories rife with cultural misrepresentations.

TALES: A Taxonomy and Analysis of Cultural Representations in LLM-generated Stories

TL;DR

TALES introduces a community-informed framework to evaluate cultural representations in LLM-generated stories about India. It develops TALES-Tax to categorize misrepresentations, conducts large-scale multi-language annotations with 108 expert annotators to quantify prevalence, and creates TALES-QA to probe cultural knowledge independently of generation. The findings reveal pervasive misrepresentations, especially in Indic languages and lesser-known regions, but models retain substantial cultural knowledge, highlighting a gap between knowledge and its application in storytelling. The work advocates community-centered evaluation, examines intra-regional diversity, and offers design directions to mitigate failure modes in generative cultural representations.

Abstract

Millions of users across the globe turn to AI chatbots for their creative needs, inviting widespread interest in understanding how such chatbots represent diverse cultures. At the same time, evaluating cultural representations in open-ended tasks remains challenging and underexplored. In this work, we present TALES, an evaluation of cultural misrepresentations in LLM-generated stories for diverse Indian cultural identities. First, we develop TALES-Tax, a taxonomy of cultural misrepresentations by collating insights from participants with lived experiences in India through focus groups (N=9) and individual surveys (N=15). Using TALES-Tax, we evaluate 6 models through a large-scale annotation study spanning 2,925 annotations from 108 annotators with lived cultural experience from across 71 regions in India and 14 languages. Concerningly, we find that 88\% of the generated stories contain one or more cultural inaccuracies, and such errors are more prevalent in mid- and low-resourced languages and stories based in peri-urban regions in India. Lastly, we transform the annotations into TALES-QA, a standalone question bank to evaluate the cultural knowledge of foundational models. Through this evaluation, we surprisingly discover that models often possess the requisite cultural knowledge despite generating stories rife with cultural misrepresentations.

Paper Structure

This paper contains 60 sections, 5 figures, 16 tables.

Figures (5)

  • Figure 1: A broad overview of TALES: (RQ1) we identified categories of cultural misrepresentation through focus groups and surveys to develop TALES-Tax, (RQ2) conducted a large-scale annotation study to quantify the frequency of misrepresentation, and (RQ3) constructed TALES-QA from the annotated data to evaluate the cultural knowledge of models.
  • Figure 2: Annotation interface where participants could read stories, mark spans, and assign them to a category of misrepresentation, and leave a comment in the comment box explaining their reasoning.
  • Figure 3: Average number of misrepresentations per story across high, mid, and low resources. Models generate statistically significantly more misrepresentations for mid and low-resourced languages, with linguistic inaccuracies increasing most. The dotted line indicates the overall average across all models.
  • Figure 4: Average misrepresentations per story across tiers. Models make statistically significantly more misrepresentations for tier-$2$ and tier-$3$ regions, with cultural and factual inaccuracies having the highest increase.
  • Figure 5: Frequency of misrepresentation of different CSI categories. Food, social practices, and social norms were highly misrepresented across all misrepresentation categories.