Table of Contents
Fetching ...

ClimateBert: A Pretrained Language Model for Climate-Related Text

Nicolas Webersinke, Mathias Kraus, Julia Anna Bingler, Markus Leippold

TL;DR

The paper addresses the gap between general-domain language models and climate-specific text by introducing ClimateBert, a climate-domain pretrained model built via continued pretraining on a large climate corpus. Using DistilRoBERTa as a base, it applies vocabulary augmentation and multiple sample-selection strategies, and evaluates on text classification, sentiment analysis, and climate-related fact-checking, showing a roughly 46–48% reduction in masked LM loss and consistent downstream gains. Notably, ClimateBert_D+S achieves state-of-the-art performance on the climate-fever fact-checking dataset, while Sim-Select and related strategies yield substantial improvements on classification and sentiment tasks. The work also discusses the carbon footprint of training, offsets emissions, and commits to open-sourcing models and code to advance climate NLP research.

Abstract

Over the recent years, large pretrained language models (LM) have revolutionized the field of natural language processing (NLP). However, while pretraining on general language has been shown to work very well for common language, it has been observed that niche language poses problems. In particular, climate-related texts include specific language that common LMs can not represent accurately. We argue that this shortcoming of today's LMs limits the applicability of modern NLP to the broad field of text processing of climate-related texts. As a remedy, we propose CLIMATEBERT, a transformer-based language model that is further pretrained on over 2 million paragraphs of climate-related texts, crawled from various sources such as common news, research articles, and climate reporting of companies. We find that CLIMATEBERT leads to a 48% improvement on a masked language model objective which, in turn, leads to lowering error rates by 3.57% to 35.71% for various climate-related downstream tasks like text classification, sentiment analysis, and fact-checking.

ClimateBert: A Pretrained Language Model for Climate-Related Text

TL;DR

The paper addresses the gap between general-domain language models and climate-specific text by introducing ClimateBert, a climate-domain pretrained model built via continued pretraining on a large climate corpus. Using DistilRoBERTa as a base, it applies vocabulary augmentation and multiple sample-selection strategies, and evaluates on text classification, sentiment analysis, and climate-related fact-checking, showing a roughly 46–48% reduction in masked LM loss and consistent downstream gains. Notably, ClimateBert_D+S achieves state-of-the-art performance on the climate-fever fact-checking dataset, while Sim-Select and related strategies yield substantial improvements on classification and sentiment tasks. The work also discusses the carbon footprint of training, offsets emissions, and commits to open-sourcing models and code to advance climate NLP research.

Abstract

Over the recent years, large pretrained language models (LM) have revolutionized the field of natural language processing (NLP). However, while pretraining on general language has been shown to work very well for common language, it has been observed that niche language poses problems. In particular, climate-related texts include specific language that common LMs can not represent accurately. We argue that this shortcoming of today's LMs limits the applicability of modern NLP to the broad field of text processing of climate-related texts. As a remedy, we propose CLIMATEBERT, a transformer-based language model that is further pretrained on over 2 million paragraphs of climate-related texts, crawled from various sources such as common news, research articles, and climate reporting of companies. We find that CLIMATEBERT leads to a 48% improvement on a masked language model objective which, in turn, leads to lowering error rates by 3.57% to 35.71% for various climate-related downstream tasks like text classification, sentiment analysis, and fact-checking.

Paper Structure

This paper contains 23 sections, 1 equation, 1 figure, 11 tables.

Figures (1)

  • Figure 1: Sequence of training phases. Our main contribution is the continued pretraining of language models on the climate domain. In addition, we evaluate the obtained climate domain-specific language models on various downstream tasks.