CliME: Evaluating Multimodal Climate Discourse on Social Media and the Climate Alignment Quotient (CAQ)
Abhilekh Borah, Hasnat Md Abdullah, Kangda Wei, Ruihong Huang
TL;DR
This work addresses how to evaluate climate-related discourse produced by LLMs on multimodal social media content. It introduces CliME, a dataset of 2,579 image-text posts from Reddit and Twitter, with descriptors generated by a vision-language model and refined by human annotations, and it couples this with the Climate Alignment Quotient CAQ, a five-dimension metric (Resonance, Articulation, Evidence, Transition, Specificity) integrated with three analytical lenses (Actionability, Criticality, Justice). The CAQ is computed as a weighted sum, $CAQ = w_1 \cdot Resonance + w_2 \cdot Articulation + w_3 \cdot Evidence + w_4 \cdot Transition + w_5 \cdot Specificity$, where $w_1=0.25$, $w_2=0.30$, $w_3=0.20$, $w_4=0.15$, and $w_5=0.10$. By benchmarking five state-of-the-art LLMs across the three lenses, the study finds high Resonance and variable Actionability and Criticality, with transitions being a weakly represented aspect and Justice scores being consistently similar, offering guidance for improving grounded, actionable climate communication and informing policy-relevant public discourse. The work provides a public dataset and code, enabling broader research into reliable multimodal climate communication and the responsible use of LLMs in this domain.
Abstract
The rise of Large Language Models (LLMs) has raised questions about their ability to understand climate-related contexts. Though climate change dominates social media, analyzing its multimodal expressions is understudied, and current tools have failed to determine whether LLMs amplify credible solutions or spread unsubstantiated claims. To address this, we introduce CliME (Climate Change Multimodal Evaluation), a first-of-its-kind multimodal dataset, comprising 2579 Twitter and Reddit posts. The benchmark features a diverse collection of humorous memes and skeptical posts, capturing how these formats distill complex issues into viral narratives that shape public opinion and policy discussions. To systematically evaluate LLM performance, we present the Climate Alignment Quotient (CAQ), a novel metric comprising five distinct dimensions: Articulation, Evidence, Resonance, Transition, and Specificity. Additionally, we propose three analytical lenses: Actionability, Criticality, and Justice, to guide the assessment of LLM-generated climate discourse using CAQ. Our findings, based on the CAQ metric, indicate that while most evaluated LLMs perform relatively well in Criticality and Justice, they consistently underperform on the Actionability axis. Among the models evaluated, Claude 3.7 Sonnet achieves the highest overall performance. We publicly release our CliME dataset and code to foster further research in this domain.
