Table of Contents
Fetching ...

CliME: Evaluating Multimodal Climate Discourse on Social Media and the Climate Alignment Quotient (CAQ)

Abhilekh Borah, Hasnat Md Abdullah, Kangda Wei, Ruihong Huang

TL;DR

This work addresses how to evaluate climate-related discourse produced by LLMs on multimodal social media content. It introduces CliME, a dataset of 2,579 image-text posts from Reddit and Twitter, with descriptors generated by a vision-language model and refined by human annotations, and it couples this with the Climate Alignment Quotient CAQ, a five-dimension metric (Resonance, Articulation, Evidence, Transition, Specificity) integrated with three analytical lenses (Actionability, Criticality, Justice). The CAQ is computed as a weighted sum, $CAQ = w_1 \cdot Resonance + w_2 \cdot Articulation + w_3 \cdot Evidence + w_4 \cdot Transition + w_5 \cdot Specificity$, where $w_1=0.25$, $w_2=0.30$, $w_3=0.20$, $w_4=0.15$, and $w_5=0.10$. By benchmarking five state-of-the-art LLMs across the three lenses, the study finds high Resonance and variable Actionability and Criticality, with transitions being a weakly represented aspect and Justice scores being consistently similar, offering guidance for improving grounded, actionable climate communication and informing policy-relevant public discourse. The work provides a public dataset and code, enabling broader research into reliable multimodal climate communication and the responsible use of LLMs in this domain.

Abstract

The rise of Large Language Models (LLMs) has raised questions about their ability to understand climate-related contexts. Though climate change dominates social media, analyzing its multimodal expressions is understudied, and current tools have failed to determine whether LLMs amplify credible solutions or spread unsubstantiated claims. To address this, we introduce CliME (Climate Change Multimodal Evaluation), a first-of-its-kind multimodal dataset, comprising 2579 Twitter and Reddit posts. The benchmark features a diverse collection of humorous memes and skeptical posts, capturing how these formats distill complex issues into viral narratives that shape public opinion and policy discussions. To systematically evaluate LLM performance, we present the Climate Alignment Quotient (CAQ), a novel metric comprising five distinct dimensions: Articulation, Evidence, Resonance, Transition, and Specificity. Additionally, we propose three analytical lenses: Actionability, Criticality, and Justice, to guide the assessment of LLM-generated climate discourse using CAQ. Our findings, based on the CAQ metric, indicate that while most evaluated LLMs perform relatively well in Criticality and Justice, they consistently underperform on the Actionability axis. Among the models evaluated, Claude 3.7 Sonnet achieves the highest overall performance. We publicly release our CliME dataset and code to foster further research in this domain.

CliME: Evaluating Multimodal Climate Discourse on Social Media and the Climate Alignment Quotient (CAQ)

TL;DR

This work addresses how to evaluate climate-related discourse produced by LLMs on multimodal social media content. It introduces CliME, a dataset of 2,579 image-text posts from Reddit and Twitter, with descriptors generated by a vision-language model and refined by human annotations, and it couples this with the Climate Alignment Quotient CAQ, a five-dimension metric (Resonance, Articulation, Evidence, Transition, Specificity) integrated with three analytical lenses (Actionability, Criticality, Justice). The CAQ is computed as a weighted sum, , where , , , , and . By benchmarking five state-of-the-art LLMs across the three lenses, the study finds high Resonance and variable Actionability and Criticality, with transitions being a weakly represented aspect and Justice scores being consistently similar, offering guidance for improving grounded, actionable climate communication and informing policy-relevant public discourse. The work provides a public dataset and code, enabling broader research into reliable multimodal climate communication and the responsible use of LLMs in this domain.

Abstract

The rise of Large Language Models (LLMs) has raised questions about their ability to understand climate-related contexts. Though climate change dominates social media, analyzing its multimodal expressions is understudied, and current tools have failed to determine whether LLMs amplify credible solutions or spread unsubstantiated claims. To address this, we introduce CliME (Climate Change Multimodal Evaluation), a first-of-its-kind multimodal dataset, comprising 2579 Twitter and Reddit posts. The benchmark features a diverse collection of humorous memes and skeptical posts, capturing how these formats distill complex issues into viral narratives that shape public opinion and policy discussions. To systematically evaluate LLM performance, we present the Climate Alignment Quotient (CAQ), a novel metric comprising five distinct dimensions: Articulation, Evidence, Resonance, Transition, and Specificity. Additionally, we propose three analytical lenses: Actionability, Criticality, and Justice, to guide the assessment of LLM-generated climate discourse using CAQ. Our findings, based on the CAQ metric, indicate that while most evaluated LLMs perform relatively well in Criticality and Justice, they consistently underperform on the Actionability axis. Among the models evaluated, Claude 3.7 Sonnet achieves the highest overall performance. We publicly release our CliME dataset and code to foster further research in this domain.

Paper Structure

This paper contains 18 sections, 3 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: CliME sample data: Each data point includes a climate change-related image from a Reddit or Twitter post, the accompanying post text, and a generated description integrating both the image and text.
  • Figure 2: Overview of the Climate Change Multimodal Evaluation (CliME) dataset and the Climate Alignment Quotient (CAQ) workflow. The upper section illustrates the data collection process from Twitter and Reddit posts, utilizing multimodal sources (text and images) and description generation through the Janus-Pro-7B model followed by human annotations. The lower section demonstrates the CAQ evaluation framework, integrating multimodal data and analytical lenses (Actionability, Criticality, Justice) to assess climate communication across five dimensions: Articulation, Evidence, Resonance, Transition, and Specificity.
  • Figure 3: Comparative analysis of CAQ scores across the Actionability, Criticality, and Justice lenses for five large language models: Claude 3.7 Sonnet, GPT-4o, LLaMA 3.3 70B, Qwen QwQ 32B, and Gemini 2.0 Flash, on CliME. Each bar represents the mean CAQ score, and error bars indicate the standard deviation, showcasing the variability in model performance. Claude 3.7 Sonnet is seen to generally outperform other models, across all the lenses, with scores consistently above 0.70 and relatively consistent standard deviations.
  • Figure 4: 3D scatter plot of CAQ scores for the Claude 3.7 Sonnet model on the CliME dataset. Each data point represents a description's CAQ values along three axes: Actionability (x-axis), Criticality (y-axis), and Justice (z-axis). The color scale (legend on the right) indicates the CAQ score of Actionability. Points near the center denote balanced discourse across all dimensions, whereas deviations along any axis suggest an overemphasis or under-representation of that particular lens.
  • Figure 5: Gap Analysis for Claude 3.7 Sonnet's CAQ Score Performance. The left panel shows box plots of scores across three dimensions: Actionability (mean: 0.7321), Criticality (mean: 0.7416), and Justice (mean: 0.7321). The right panel displays a heatmap of gap statistics between dimension pairs, with the Actionability-Justice gap (0.0344) being the most significant, followed by the Actionability-Criticality gap (0.0324), while the Criticality-Justice gap (0.0313) shows the best balance. The analysis reveals more balanced dimensional scores compared to other models, with fewer large gaps across all dimension pairs.
  • ...and 4 more figures