A Lightweight LLM Framework for Disaster Humanitarian Information Classification

Han Jinzhen; Kim Jisung; Yang Jong Soo; Yun Hong Sik

A Lightweight LLM Framework for Disaster Humanitarian Information Classification

Han Jinzhen, Kim Jisung, Yang Jong Soo, Yun Hong Sik

TL;DR

This work presents a lightweight LLM-based framework for dual-task disaster information classification on HumAID, leveraging Llama 3.1 8B with prompting, LoRA, and RAG to balance accuracy with resource constraints. LoRA fine-tuning yields strong improvements, achieving ~$79.62\%$ humanitarian accuracy and ~$98.79\%$ event-type accuracy, while QLoRA maintains ~99\% of LoRA performance at half the memory. Counterintuitively, RAG degrades performance for fine-tuned models due to label-noise in retrieved examples, and GPT-4 analyses reveal intrinsic ambiguity in several humanitarian categories that limits achievable accuracy. The results advocate prioritizing parameter-efficient fine-tuning over retrieval augmentation for well-defined, data-rich crisis tasks, and highlight taxonomy refinement as a practical pathway to improved reliability. The proposed, reproducible pipeline enables scalable crisis intelligence with modest computational resources, supporting timely and actionable disaster response.

Abstract

Timely classification of humanitarian information from social media is critical for effective disaster response. However, deploying large language models (LLMs) for this task faces challenges in resource-constrained emergency settings. This paper develops a lightweight, cost-effective framework for disaster tweet classification using parameter-efficient fine-tuning. We construct a unified experimental corpus by integrating and normalizing the HumAID dataset (76,484 tweets across 19 disaster events) into a dual-task benchmark: humanitarian information categorization and event type identification. Through systematic evaluation of prompting strategies, LoRA fine-tuning, and retrieval-augmented generation (RAG) on Llama 3.1 8B, we demonstrate that: (1) LoRA achieves 79.62% humanitarian classification accuracy (+37.79% over zero-shot) while training only ~2% of parameters; (2) QLoRA enables efficient deployment with 99.4% of LoRA performance at 50% memory cost; (3) contrary to common assumptions, RAG strategies degrade fine-tuned model performance due to label noise from retrieved examples. These findings establish a practical, reproducible pipeline for building reliable crisis intelligence systems with limited computational resources.

A Lightweight LLM Framework for Disaster Humanitarian Information Classification

TL;DR

humanitarian accuracy and ~

event-type accuracy, while QLoRA maintains ~99\% of LoRA performance at half the memory. Counterintuitively, RAG degrades performance for fine-tuned models due to label-noise in retrieved examples, and GPT-4 analyses reveal intrinsic ambiguity in several humanitarian categories that limits achievable accuracy. The results advocate prioritizing parameter-efficient fine-tuning over retrieval augmentation for well-defined, data-rich crisis tasks, and highlight taxonomy refinement as a practical pathway to improved reliability. The proposed, reproducible pipeline enables scalable crisis intelligence with modest computational resources, supporting timely and actionable disaster response.

Abstract

Paper Structure (66 sections, 11 equations, 14 figures, 10 tables, 2 algorithms)

This paper contains 66 sections, 11 equations, 14 figures, 10 tables, 2 algorithms.

Introduction
Data and Preprocessing
Category Imbalance
Cross-Disaster Variability
Split Consistency
Implications for Research
Large Language Models for Humanitarian Classification
Transformer self-attention.
Autoregressive language modeling and in-context evaluation
Backbone choice
LLaMA 3.1 8B Architecture
Suitability for Humanitarian Classification
Method
Baseline Prompting Strategies
Task Formulation
...and 51 more sections

Figures (14)

Figure 1: HumAID dataset preprocessing pipeline transforming raw TSV files into unified JSONL format with 76,484 samples.
Figure 2: HumAID dataset distribution across humanitarian categories and disaster types. Pie sectors represent train/dev/test splits (green/blue/red); color intensity indicates log-scaled counts; pie size reflects total samples.
Figure 3: t-SNE visualization of embedding spaces comparing (a) pre-trained and (b) fine-tuned models. Colors indicate humanitarian categories, marker shapes distinguish data splits (circle: train, square: dev, triangle: test). Clustering metrics are shown in the upper-left corner of each subplot.
Figure 4: Humanitarian label classification performance by disaster type and category. Each cell shows four triangles (Zero-shot: bottom, Manual: right, Static: top, Dynamic: left). Deeper red indicates higher scores. The N/A cell indicates no test samples exist for that combination (earthquake $\times$ missing_or_found_people).
Figure 5: Event type classification performance across four disaster categories. Dynamic Few-shot consistently achieves the highest scores across all event types.
...and 9 more figures

A Lightweight LLM Framework for Disaster Humanitarian Information Classification

TL;DR

Abstract

A Lightweight LLM Framework for Disaster Humanitarian Information Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (14)