A Lightweight LLM Framework for Disaster Humanitarian Information Classification
Han Jinzhen, Kim Jisung, Yang Jong Soo, Yun Hong Sik
TL;DR
This work presents a lightweight LLM-based framework for dual-task disaster information classification on HumAID, leveraging Llama 3.1 8B with prompting, LoRA, and RAG to balance accuracy with resource constraints. LoRA fine-tuning yields strong improvements, achieving ~$79.62\%$ humanitarian accuracy and ~$98.79\%$ event-type accuracy, while QLoRA maintains ~99\% of LoRA performance at half the memory. Counterintuitively, RAG degrades performance for fine-tuned models due to label-noise in retrieved examples, and GPT-4 analyses reveal intrinsic ambiguity in several humanitarian categories that limits achievable accuracy. The results advocate prioritizing parameter-efficient fine-tuning over retrieval augmentation for well-defined, data-rich crisis tasks, and highlight taxonomy refinement as a practical pathway to improved reliability. The proposed, reproducible pipeline enables scalable crisis intelligence with modest computational resources, supporting timely and actionable disaster response.
Abstract
Timely classification of humanitarian information from social media is critical for effective disaster response. However, deploying large language models (LLMs) for this task faces challenges in resource-constrained emergency settings. This paper develops a lightweight, cost-effective framework for disaster tweet classification using parameter-efficient fine-tuning. We construct a unified experimental corpus by integrating and normalizing the HumAID dataset (76,484 tweets across 19 disaster events) into a dual-task benchmark: humanitarian information categorization and event type identification. Through systematic evaluation of prompting strategies, LoRA fine-tuning, and retrieval-augmented generation (RAG) on Llama 3.1 8B, we demonstrate that: (1) LoRA achieves 79.62% humanitarian classification accuracy (+37.79% over zero-shot) while training only ~2% of parameters; (2) QLoRA enables efficient deployment with 99.4% of LoRA performance at 50% memory cost; (3) contrary to common assumptions, RAG strategies degrade fine-tuned model performance due to label noise from retrieved examples. These findings establish a practical, reproducible pipeline for building reliable crisis intelligence systems with limited computational resources.
