Table of Contents
Fetching ...

Cross-Domain Generalization of Multimodal LLMs for Global Photovoltaic Assessment

Muhao Guo, Yang Weng

TL;DR

The paper tackles the challenge of undocumented distributed PV installations by proposing a cross-domain generalization study for global PV assessment using a multimodal LLM (PVAL). It fuses detection, localization, and quantification through structured prompts and fine-tuning on satellite imagery across seven global regions, and evaluates transferability with the ΔF1 metric. Results show that PVAL achieves the smallest degradation in unseen regions compared with traditional CV and transformer baselines, highlighting the robustness of semantic reasoning over low-level texture cues for cross-domain PV mapping. The work demonstrates the potential of multimodal LLMs as scalable, transferable, and interpretable tools for global PV monitoring and grid planning.

Abstract

The rapid expansion of distributed photovoltaic (PV) systems poses challenges for power grid management, as many installations remain undocumented. While satellite imagery provides global coverage, traditional computer vision (CV) models such as CNNs and U-Nets require extensive labeled data and fail to generalize across regions. This study investigates the cross-domain generalization of a multimodal large language model (LLM) for global PV assessment. By leveraging structured prompts and fine-tuning, the model integrates detection, localization, and quantification within a unified schema. Cross-regional evaluation using the $Δ$F1 metric demonstrates that the proposed model achieves the smallest performance degradation across unseen regions, outperforming conventional CV and transformer baselines. These results highlight the robustness of multimodal LLMs under domain shift and their potential for scalable, transferable, and interpretable global PV mapping.

Cross-Domain Generalization of Multimodal LLMs for Global Photovoltaic Assessment

TL;DR

The paper tackles the challenge of undocumented distributed PV installations by proposing a cross-domain generalization study for global PV assessment using a multimodal LLM (PVAL). It fuses detection, localization, and quantification through structured prompts and fine-tuning on satellite imagery across seven global regions, and evaluates transferability with the ΔF1 metric. Results show that PVAL achieves the smallest degradation in unseen regions compared with traditional CV and transformer baselines, highlighting the robustness of semantic reasoning over low-level texture cues for cross-domain PV mapping. The work demonstrates the potential of multimodal LLMs as scalable, transferable, and interpretable tools for global PV monitoring and grid planning.

Abstract

The rapid expansion of distributed photovoltaic (PV) systems poses challenges for power grid management, as many installations remain undocumented. While satellite imagery provides global coverage, traditional computer vision (CV) models such as CNNs and U-Nets require extensive labeled data and fail to generalize across regions. This study investigates the cross-domain generalization of a multimodal large language model (LLM) for global PV assessment. By leveraging structured prompts and fine-tuning, the model integrates detection, localization, and quantification within a unified schema. Cross-regional evaluation using the F1 metric demonstrates that the proposed model achieves the smallest performance degradation across unseen regions, outperforming conventional CV and transformer baselines. These results highlight the robustness of multimodal LLMs under domain shift and their potential for scalable, transferable, and interpretable global PV mapping.

Paper Structure

This paper contains 11 sections, 3 equations, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: Framework of the proposed multimodal LLM for global photovoltaic (PV) assessment. The pipeline includes data engineering, prompt engineering, and fine-tuning stages. Rooftop imagery from seven global regions is used to evaluate cross-domain generalization, with $\Delta$F1 representing performance changes between the fine-tuning and unseen regions.
  • Figure 2: Solar panel labeling schema and representative examples. The left panel shows nine spatial regions (top, bottom, left, right, center, and four diagonals), plus the “NA” case indicating no solar panels. The right panel displays annotated rooftop tiles with labels specifying (i) presence (True/False), (ii) location, and (iii) quantity range.
  • Figure 3: Comparison of model performance across Precision, Recall, F1-Score, and Accuracy metrics.
  • Figure 4: Comparison of F1-scores for solar and non-solar datasets across different models.
  • Figure 5: Accuracy comparison of solar panel location and quantity detection for fine-tuning and test datasets.
  • ...and 2 more figures