CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts

Rabindra Lamsal; Maria Rodriguez Read; Shanika Karunasekera

CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts

Rabindra Lamsal, Maria Rodriguez Read, Shanika Karunasekera

TL;DR

CrisisTransformers address the challenge of processing crisis-related social media by introducing a domain-specific ensemble of pre-trained language models and sentence encoders trained on a massive crisis tweet corpus (>15 billion tokens from 30+ events). The approach includes three variants (CT-M1 from scratch, CT-M2 RoBERTa-initialized, CT-M3 BERTweet-initialized) and a contrastive-learning–based sentence encoder that achieves a 17.43% improvement in semantic encoding tasks over the prior state-of-the-art. Evaluations across 18 crisis datasets show consistent classification gains, while the sentence encoders excel at semantic similarity tasks essential for semantic search and clustering. The work provides publicly available CrisisTransformers, enabling robust, domain-adapted crisis text processing for emergency response and crisis management applications, with plans for multilingual expansion and further scaling of sentence encoders.

Abstract

Social media platforms play an essential role in crisis communication, but analyzing crisis-related social media texts is challenging due to their informal nature. Transformer-based pre-trained models like BERT and RoBERTa have shown success in various NLP tasks, but they are not tailored for crisis-related texts. Furthermore, general-purpose sentence encoders are used to generate sentence embeddings, regardless of the textual complexities in crisis-related texts. Advances in applications like text classification, semantic search, and clustering contribute to the effective processing of crisis-related texts, which is essential for emergency responders to gain a comprehensive view of a crisis event, whether historical or real-time. To address these gaps in crisis informatics literature, this study introduces CrisisTransformers, an ensemble of pre-trained language models and sentence encoders trained on an extensive corpus of over 15 billion word tokens from tweets associated with more than 30 crisis events, including disease outbreaks, natural disasters, conflicts, and other critical incidents. We evaluate existing models and CrisisTransformers on 18 crisis-specific public datasets. Our pre-trained models outperform strong baselines across all datasets in classification tasks, and our best-performing sentence encoder improves the state-of-the-art by 17.43% in sentence encoding tasks. Additionally, we investigate the impact of model initialization on convergence and evaluate the significance of domain-specific models in generating semantically meaningful sentence embeddings. The models are publicly available at: https://huggingface.co/crisistransformers

CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts

TL;DR

Abstract

Paper Structure (23 sections, 5 equations, 5 figures, 10 tables)

This paper contains 23 sections, 5 equations, 5 figures, 10 tables.

Introduction
Related Work
Materials and methods
The crisis corpus
Text pre-processing
Unsupervised pre-training
Architecture and pre-training procedure
Pre-training data
Optimization
Fine-tuning
Labelled crisis-related datasets
Enriching sentence encoding
Evaluation setup
Classification task
Sentence encoding task
...and 8 more sections

Figures (5)

Figure 1: A high-level methodological view for developing pre-trained models and sentence encoders.
Figure 2: The pre-training corpus curation process.
Figure 3: Pre-training of CrisisTransformers. Note: "*" represents different checkpoints, which will be discussed later in Section \ref{['results']}.
Figure 4: Training of our sentence encoders.
Figure 5: Validation loss versus epoch for CrisisTransformers' CT-M1-*, CT-M2-*, and CT-M3-* checkpoints, showing the impact of different initializations. The loss for CT-M1 at Epoch 0 was $9.841$, and it achieved its lowest loss at the 26th epoch. For CT-M2, the loss at Epoch 0 was 2.26, and it achieved its lowest loss at the 8th epoch. Lastly, CT-M3 started with a loss of 2.856 at Epoch 0 and reached its lowest loss at the 15th epoch. The y-axis is truncated to a maximum value of 3 for clarity. Although the data extends to $9.841$ on the y-axis, focusing on the range up to 3 enhances the visibility of differences between the plots, which may otherwise be overshadowed by the scale.

CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts

TL;DR

Abstract

CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts

Authors

TL;DR

Abstract

Table of Contents

Figures (5)