CyPortQA: Benchmarking Multimodal Large Language Models for Cyclone Preparedness in Port Operation
Chenchen Kuai, Chenhao Wu, Yang Zhou, Xiubin Bruce Wang, Tianbao Yang, Zhengzhong Tu, Zihao Li, Yunlong Zhang
TL;DR
CyPortQA addresses the need for reliable, multimodal decision-support for cyclone preparedness in port operations by compiling a nine-year, multi-port real-world dataset that links NOAA forecast products, port performance, and USCG bulletins. The authors build a scenario-encoding and automated QA-generation pipeline to produce 117,178 QA pairs across three task categories: situation understanding, impact estimation, and decision reasoning. Through benchmarking seven MLLMs (open-source and proprietary), they find strong situational awareness but notable gaps in quantitative impact estimation and actionable decision support, with proprietary models generally outperforming open-source baselines. The work provides a publicly available benchmark and protocol to advance research on robust, LLM-assisted critical-infrastructure resilience.
Abstract
As tropical cyclones intensify and track forecasts become increasingly uncertain, U.S. ports face heightened supply-chain risk under extreme weather conditions. Port operators need to rapidly synthesize diverse multimodal forecast products, such as probabilistic wind maps, track cones, and official advisories, into clear, actionable guidance as cyclones approach. Multimodal large language models (MLLMs) offer a powerful means to integrate these heterogeneous data sources alongside broader contextual knowledge, yet their accuracy and reliability in the specific context of port cyclone preparedness have not been rigorously evaluated. To fill this gap, we introduce CyPortQA, the first multimodal benchmark tailored to port operations under cyclone threat. CyPortQA assembles 2,917 realworld disruption scenarios from 2015 through 2023, spanning 145 U.S. principal ports and 90 named storms. Each scenario fuses multisource data (i.e., tropical cyclone products, port operational impact records, and port condition bulletins) and is expanded through an automated pipeline into 117,178 structured question answer pairs. Using this benchmark, we conduct extensive experiments on diverse MLLMs, including both open-source and proprietary model. MLLMs demonstrate great potential in situation understanding but still face considerable challenges in reasoning tasks, including potential impact estimation and decision reasoning.
