VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment
Md. Mahfuzur Rahman, Kishor Datta Gupta, Marufa Kamal, Fahad Rahman, Sunzida Siddique, Ahmed Rafi Hasan, Mohd Ariful Haque, Roy George
TL;DR
VLCE tackles the need for domain-specific disaster image captions by integrating external semantic knowledge from ConceptNet and WordNet into vision-language captioning. It employs dual architectures—CNN-LSTM and a ViT-based Transformer—paired withKG-enhanced embeddings and cross-modal attention to produce context-rich, actionable descriptions for satellite and UAV imagery. The approach uses keyword extraction, lexical enrichment, and knowledge-informed embeddings to close the semantic gap between visuals and disaster-domain language, with evaluation on RescueNet and xBD using CLIPScore and InfoMetIC, achieving strong informativeness (e.g., up to 95.33% InfoMetIC on UAV data) and competitive semantic alignment. Findings show knowledge graphs consistently boost caption informativeness and reduce hallucinations, with Transformer-based models offering robustness across datasets, making the framework suitable for real-time disaster assessment and decision support.
Abstract
The processes of classification and segmentation utilizing artificial intelligence play a vital role in the automation of disaster assessments. However, contemporary VLMs produce details that are inadequately aligned with the objectives of disaster assessment, primarily due to their deficiency in domain knowledge and the absence of a more refined descriptive process. This research presents the Vision Language Caption Enhancer (VLCE), a dedicated multimodal framework aimed at integrating external semantic knowledge from ConceptNet and WordNet to improve the captioning process. The objective is to produce disaster-specific descriptions that effectively convert raw visual data into actionable intelligence. VLCE utilizes two separate architectures: a CNN-LSTM model that incorporates a ResNet50 backbone, pretrained on EuroSat for satellite imagery (xBD dataset), and a Vision Transformer developed for UAV imagery (RescueNet dataset). In various architectural frameworks and datasets, VLCE exhibits a consistent advantage over baseline models such as LLaVA and QwenVL. Our optimal configuration reaches an impressive 95.33\% on InfoMetIC for UAV imagery while also demonstrating strong performance across satellite imagery. The proposed framework signifies a significant transition from basic visual classification to the generation of comprehensive situational intelligence, demonstrating immediate applicability for implementation in real-time disaster assessment systems.
