BEYONDWORDS is All You Need: Agentic Generative AI based Social Media Themes Extractor
Mohammed-Khalil Ghali, Abdelrahman Farrag, Sarah Lam, Daehan Won
TL;DR
BEYONDWORDS presents an integrated framework for scalable thematic analysis of social media using tweet embeddings, autoencoder-based dimensionality reduction, matrix factorization, and agentic Generative AI. The method iteratively refines themes via Chain-of-Thought prompting and a secondary LLM for quality control, demonstrated on autistic community tweets to reveal three core themes: content quality and engagement, advocacy and acceptance, and mental health. The approach achieves robust clustering (three latent themes) and preserves semantic nuance through embedding-based representations, enabling actionable insights for advocacy and decision-making. Its combination of scalable ML techniques and generative AI offers a practical, adaptable tool for analyzing online discourse in diverse communities.
Abstract
Thematic analysis of social media posts provides a major understanding of public discourse, yet traditional methods often struggle to capture the complexity and nuance of unstructured, large-scale text data. This study introduces a novel methodology for thematic analysis that integrates tweet embeddings from pre-trained language models, dimensionality reduction using and matrix factorization, and generative AI to identify and refine latent themes. Our approach clusters compressed tweet representations and employs generative AI to extract and articulate themes through an agentic Chain of Thought (CoT) prompting, with a secondary LLM for quality assurance. This methodology is applied to tweets from the autistic community, a group that increasingly uses social media to discuss their experiences and challenges. By automating the thematic extraction process, the aim is to uncover key insights while maintaining the richness of the original discourse. This autism case study demonstrates the utility of the proposed approach in improving thematic analysis of social media data, offering a scalable and adaptable framework that can be applied to diverse contexts. The results highlight the potential of combining machine learning and Generative AI to enhance the depth and accuracy of theme identification in online communities.
