Table of Contents
Fetching ...

The Accuracy, Robustness, and Readability of LLM-Generated Sustainability-Related Word Definitions

Alice Heiman

TL;DR

The paper investigates how well LLM-generated sustainability word definitions align with the IPCC glossary by assessing adherence, robustness, and readability across GPT-4o-mini, Llama-3.1 8B, and Mistral 7B using SBERT semantic similarity. It analyzes 300 IPCC terms, generating 25 completions per term from five prompts and employing bootstrapped readability metrics, finding average adherence around 0.57–0.59 and robustness near 0.96–1.00. Readability of model outputs surpasses that of the IPCC definitions, indicating higher complexity and longer text, which can impede accessibility. The study concludes that while LLMs can support climate discourse, outputs must be tightly aligned to established terminology, especially for terms with multiple meanings, and suggests leveraging explicit in-context definitions to improve reliability.

Abstract

A common language with standardized definitions is crucial for effective climate discussions. However, concerns exist about LLMs misrepresenting climate terms. We compared 300 official IPCC glossary definitions with those generated by GPT-4o-mini, Llama3.1 8B, and Mistral 7B, analyzing adherence, robustness, and readability using SBERT sentence embeddings. The LLMs scored an average adherence of $0.57-0.59 \pm 0.15$, and their definitions proved harder to read than the originals. Model-generated definitions vary mainly among words with multiple or ambiguous definitions, showing the potential to highlight terms that need standardization. The results show how LLMs could support environmental discourse while emphasizing the need to align model outputs with established terminology for clarity and consistency.

The Accuracy, Robustness, and Readability of LLM-Generated Sustainability-Related Word Definitions

TL;DR

The paper investigates how well LLM-generated sustainability word definitions align with the IPCC glossary by assessing adherence, robustness, and readability across GPT-4o-mini, Llama-3.1 8B, and Mistral 7B using SBERT semantic similarity. It analyzes 300 IPCC terms, generating 25 completions per term from five prompts and employing bootstrapped readability metrics, finding average adherence around 0.57–0.59 and robustness near 0.96–1.00. Readability of model outputs surpasses that of the IPCC definitions, indicating higher complexity and longer text, which can impede accessibility. The study concludes that while LLMs can support climate discourse, outputs must be tightly aligned to established terminology, especially for terms with multiple meanings, and suggests leveraging explicit in-context definitions to improve reliability.

Abstract

A common language with standardized definitions is crucial for effective climate discussions. However, concerns exist about LLMs misrepresenting climate terms. We compared 300 official IPCC glossary definitions with those generated by GPT-4o-mini, Llama3.1 8B, and Mistral 7B, analyzing adherence, robustness, and readability using SBERT sentence embeddings. The LLMs scored an average adherence of , and their definitions proved harder to read than the originals. Model-generated definitions vary mainly among words with multiple or ambiguous definitions, showing the potential to highlight terms that need standardization. The results show how LLMs could support environmental discourse while emphasizing the need to align model outputs with established terminology for clarity and consistency.

Paper Structure

This paper contains 10 sections, 2 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Distribution of SBERT adherence scores between LLM and official IPCC word definitions.