Table of Contents
Fetching ...

Cross-cultural Inspiration Detection and Analysis in Real and LLM-generated Social Media Data

Oana Ignat, Gayathri Ganesh Lakshmy, Rada Mihalcea

TL;DR

Cross-cultural inspiration detection and generation is addressed by building InspAIred, a dataset of 2,000 real inspiring posts, 2,000 real non-inspiring posts, and 2,000 GPT-4 generated inspiring posts distributed across India and the UK. The authors perform linguistic analyses (stylistic, semantic, LIWC) and topic modeling to compare content across cultures and to contrast AI-generated with human-authored posts, while evaluating detection performance with RF-TF-IDF, XLM-RoBERTa, and LoRA-Llama setups. They demonstrate high cross-cultural discrimination accuracy, including in few-shot regimes, and provide substantial qualitative and quantitative insights into how inspiration manifests differently across cultures and data sources. The public InspAIred dataset and baselines offer a resource for advancing cross-cultural NLP research on motivation, creativity, and content generation with practical implications for education, health, and media applications.

Abstract

Inspiration is linked to various positive outcomes, such as increased creativity, productivity, and happiness. Although inspiration has great potential, there has been limited effort toward identifying content that is inspiring, as opposed to just engaging or positive. Additionally, most research has concentrated on Western data, with little attention paid to other cultures. This work is the first to study cross-cultural inspiration through machine learning methods. We aim to identify and analyze real and AI-generated cross-cultural inspiring posts. To this end, we compile and make publicly available the InspAIred dataset, which consists of 2,000 real inspiring posts, 2,000 real non-inspiring posts, and 2,000 generated inspiring posts evenly distributed across India and the UK. The real posts are sourced from Reddit, while the generated posts are created using the GPT-4 model. Using this dataset, we conduct extensive computational linguistic analyses to (1) compare inspiring content across cultures, (2) compare AI-generated inspiring posts to real inspiring posts, and (3) determine if detection models can accurately distinguish between inspiring content across cultures and data sources.

Cross-cultural Inspiration Detection and Analysis in Real and LLM-generated Social Media Data

TL;DR

Cross-cultural inspiration detection and generation is addressed by building InspAIred, a dataset of 2,000 real inspiring posts, 2,000 real non-inspiring posts, and 2,000 GPT-4 generated inspiring posts distributed across India and the UK. The authors perform linguistic analyses (stylistic, semantic, LIWC) and topic modeling to compare content across cultures and to contrast AI-generated with human-authored posts, while evaluating detection performance with RF-TF-IDF, XLM-RoBERTa, and LoRA-Llama setups. They demonstrate high cross-cultural discrimination accuracy, including in few-shot regimes, and provide substantial qualitative and quantitative insights into how inspiration manifests differently across cultures and data sources. The public InspAIred dataset and baselines offer a resource for advancing cross-cultural NLP research on motivation, creativity, and content generation with practical implications for education, health, and media applications.

Abstract

Inspiration is linked to various positive outcomes, such as increased creativity, productivity, and happiness. Although inspiration has great potential, there has been limited effort toward identifying content that is inspiring, as opposed to just engaging or positive. Additionally, most research has concentrated on Western data, with little attention paid to other cultures. This work is the first to study cross-cultural inspiration through machine learning methods. We aim to identify and analyze real and AI-generated cross-cultural inspiring posts. To this end, we compile and make publicly available the InspAIred dataset, which consists of 2,000 real inspiring posts, 2,000 real non-inspiring posts, and 2,000 generated inspiring posts evenly distributed across India and the UK. The real posts are sourced from Reddit, while the generated posts are created using the GPT-4 model. Using this dataset, we conduct extensive computational linguistic analyses to (1) compare inspiring content across cultures, (2) compare AI-generated inspiring posts to real inspiring posts, and (3) determine if detection models can accurately distinguish between inspiring content across cultures and data sources.
Paper Structure (38 sections, 11 figures, 4 tables)

This paper contains 38 sections, 11 figures, 4 tables.

Figures (11)

  • Figure 1: We compare AI-generated and human-written inspiring Reddit content across India and the UK. Although it is challenging for a person to distinguish between them, we find significant linguistic cross-cultural differences between generated and real inspiring posts.
  • Figure 2: Annotation guidelines for labeling inspiration.
  • Figure 3: Visualization of topics used in the real and generated ( vs. ) inspiring posts from the UK. Points are colored red or blue based on the association of their corresponding terms with UK Real inspiring posts or UK LLM-Generated inspiring posts. The most associated topics are listed under Top Generated and Top Real headings. Interactive version: https://github.com/MichiganNLP/cross_inspiration.
  • Figure 4: Classification test accuracy with the few-shot and default setups with the Random Forest TF-IDF (RF), XLM-RoBERTa base (RB), and Llama 2.7b (LL) models.
  • Figure 5: Scattertext visualization of unigrams used in the real inspiring and non-inspiring ( vs. ✗) Reddit posts from India. Points are colored in red or blue based on the association of their corresponding terms with Indian Non-inspiring posts or Indian inspiring posts. The most associated terms are listed under "Top inspiring" and "Top Non-inspiring" headings.
  • ...and 6 more figures