Table of Contents
Fetching ...

CyPortQA: Benchmarking Multimodal Large Language Models for Cyclone Preparedness in Port Operation

Chenchen Kuai, Chenhao Wu, Yang Zhou, Xiubin Bruce Wang, Tianbao Yang, Zhengzhong Tu, Zihao Li, Yunlong Zhang

TL;DR

CyPortQA addresses the need for reliable, multimodal decision-support for cyclone preparedness in port operations by compiling a nine-year, multi-port real-world dataset that links NOAA forecast products, port performance, and USCG bulletins. The authors build a scenario-encoding and automated QA-generation pipeline to produce 117,178 QA pairs across three task categories: situation understanding, impact estimation, and decision reasoning. Through benchmarking seven MLLMs (open-source and proprietary), they find strong situational awareness but notable gaps in quantitative impact estimation and actionable decision support, with proprietary models generally outperforming open-source baselines. The work provides a publicly available benchmark and protocol to advance research on robust, LLM-assisted critical-infrastructure resilience.

Abstract

As tropical cyclones intensify and track forecasts become increasingly uncertain, U.S. ports face heightened supply-chain risk under extreme weather conditions. Port operators need to rapidly synthesize diverse multimodal forecast products, such as probabilistic wind maps, track cones, and official advisories, into clear, actionable guidance as cyclones approach. Multimodal large language models (MLLMs) offer a powerful means to integrate these heterogeneous data sources alongside broader contextual knowledge, yet their accuracy and reliability in the specific context of port cyclone preparedness have not been rigorously evaluated. To fill this gap, we introduce CyPortQA, the first multimodal benchmark tailored to port operations under cyclone threat. CyPortQA assembles 2,917 realworld disruption scenarios from 2015 through 2023, spanning 145 U.S. principal ports and 90 named storms. Each scenario fuses multisource data (i.e., tropical cyclone products, port operational impact records, and port condition bulletins) and is expanded through an automated pipeline into 117,178 structured question answer pairs. Using this benchmark, we conduct extensive experiments on diverse MLLMs, including both open-source and proprietary model. MLLMs demonstrate great potential in situation understanding but still face considerable challenges in reasoning tasks, including potential impact estimation and decision reasoning.

CyPortQA: Benchmarking Multimodal Large Language Models for Cyclone Preparedness in Port Operation

TL;DR

CyPortQA addresses the need for reliable, multimodal decision-support for cyclone preparedness in port operations by compiling a nine-year, multi-port real-world dataset that links NOAA forecast products, port performance, and USCG bulletins. The authors build a scenario-encoding and automated QA-generation pipeline to produce 117,178 QA pairs across three task categories: situation understanding, impact estimation, and decision reasoning. Through benchmarking seven MLLMs (open-source and proprietary), they find strong situational awareness but notable gaps in quantitative impact estimation and actionable decision support, with proprietary models generally outperforming open-source baselines. The work provides a publicly available benchmark and protocol to advance research on robust, LLM-assisted critical-infrastructure resilience.

Abstract

As tropical cyclones intensify and track forecasts become increasingly uncertain, U.S. ports face heightened supply-chain risk under extreme weather conditions. Port operators need to rapidly synthesize diverse multimodal forecast products, such as probabilistic wind maps, track cones, and official advisories, into clear, actionable guidance as cyclones approach. Multimodal large language models (MLLMs) offer a powerful means to integrate these heterogeneous data sources alongside broader contextual knowledge, yet their accuracy and reliability in the specific context of port cyclone preparedness have not been rigorously evaluated. To fill this gap, we introduce CyPortQA, the first multimodal benchmark tailored to port operations under cyclone threat. CyPortQA assembles 2,917 realworld disruption scenarios from 2015 through 2023, spanning 145 U.S. principal ports and 90 named storms. Each scenario fuses multisource data (i.e., tropical cyclone products, port operational impact records, and port condition bulletins) and is expanded through an automated pipeline into 117,178 structured question answer pairs. Using this benchmark, we conduct extensive experiments on diverse MLLMs, including both open-source and proprietary model. MLLMs demonstrate great potential in situation understanding but still face considerable challenges in reasoning tasks, including potential impact estimation and decision reasoning.

Paper Structure

This paper contains 24 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Port Preparedness Framework under Tropical Cyclones and the CyPortQA Benchmark. (a) The time-evolving port preparedness in response to TC, highlighting dynamic decision and key preparedness tasks. (b) Scenario-based QA construction pipeline in CyPortQA, sourcing NOAA TC products, operational performance data and port condition bulletins. (c) Representative CyPortQA examples across three preparedness tasks: S1 – Situation Understanding, S2 – Impact Estimation, and S3 – Decision Reasoning, each with corresponding multimodal inputs and question formats.
  • Figure 2: Demonstration of NOAA released tropical cyclone weather products, example data from 2017 Harvey. The data is organized every 12 hours for port operation analysis.
  • Figure 3: Spatial Awareness & Exposure Interpretation Gaps. A tick indicates a 'yes' response (the port lies within the cyclone’s uncertainty cone), while a cross indicates a 'no' response (the port lies outside).
  • Figure 4: Temporal-understanding gaps. Performance aggregated at 72, 48, 24, and 12 h before landfall. Left panel: tolerance-based accuracy; right panel: mean deviation from ground truth with 95 % confidence intervals.
  • Figure 5: Responses from MLLMs for decision reasoning tasks (port and facility operation instructions) under a single scenario. Evaluation results from the LLM-as-a-judge include a numerical score and classify each response as either an under-reaction, normal reaction, or over-reaction.