Table of Contents
Fetching ...

LLM-Powered Automatic Translation and Urgency in Crisis Scenarios

Belu Ticona, Antonis Anastasopoulos

TL;DR

This paper assesses the viability of large language models and machine translation for crisis-domain translation, with emphasis on preserving urgency in multilingual settings. It introduces a multilingual crisis dataset and an urgency-annotated corpus to test both translation quality and urgency classification. Across languages, state-of-the-art MT and LLM-based translation show substantial degradation and instability, with urgency often distorted by translation and language-dependent prompting. The findings highlight critical risks in deploying general-purpose NLP tools for crisis response and call for crisis-aware benchmarks, evaluation criteria, and cross-disciplinary collaboration to build trustworthy multilingual CPR systems.

Abstract

Large language models (LLMs) are increasingly proposed for crisis preparedness and response, particularly for multilingual communication. However, their suitability for high-stakes crisis contexts remains insufficiently evaluated. This work examines the performance of state-of-the-art LLMs and machine translation systems in crisis-domain translation, with a focus on preserving urgency, which is a critical property for effective crisis communication and triaging. Using multilingual crisis data and a newly introduced urgency-annotated dataset covering over 32 languages, we show that both dedicated translation models and LLMs exhibit substantial performance degradation and instability. Crucially, even linguistically adequate translations can distort perceived urgency, and LLM-based urgency classifications vary widely depending on the language of the prompt and input. These findings highlight significant risks in deploying general-purpose language technologies for crisis communication and underscore the need for crisis-aware evaluation frameworks.

LLM-Powered Automatic Translation and Urgency in Crisis Scenarios

TL;DR

This paper assesses the viability of large language models and machine translation for crisis-domain translation, with emphasis on preserving urgency in multilingual settings. It introduces a multilingual crisis dataset and an urgency-annotated corpus to test both translation quality and urgency classification. Across languages, state-of-the-art MT and LLM-based translation show substantial degradation and instability, with urgency often distorted by translation and language-dependent prompting. The findings highlight critical risks in deploying general-purpose NLP tools for crisis response and call for crisis-aware benchmarks, evaluation criteria, and cross-disciplinary collaboration to build trustworthy multilingual CPR systems.

Abstract

Large language models (LLMs) are increasingly proposed for crisis preparedness and response, particularly for multilingual communication. However, their suitability for high-stakes crisis contexts remains insufficiently evaluated. This work examines the performance of state-of-the-art LLMs and machine translation systems in crisis-domain translation, with a focus on preserving urgency, which is a critical property for effective crisis communication and triaging. Using multilingual crisis data and a newly introduced urgency-annotated dataset covering over 32 languages, we show that both dedicated translation models and LLMs exhibit substantial performance degradation and instability. Crucially, even linguistically adequate translations can distort perceived urgency, and LLM-based urgency classifications vary widely depending on the language of the prompt and input. These findings highlight significant risks in deploying general-purpose language technologies for crisis communication and underscore the need for crisis-aware evaluation frameworks.
Paper Structure (18 sections, 2 figures, 2 tables)

This paper contains 18 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Distribution of the urgency scores per culture of all annotated sentences, whether they are in English or in the annotator's language (right). Example urgency scenarios assessment changes in human annotations due to translation quality (left). While annotators are overall consistent, the automatic translation introduces changes in perceived urgency.
  • Figure 2: LLMs change their urgency assessment across our scenarios, largely due to translation quality.