Table of Contents
Fetching ...

From Prediction to Diagnosis: Reasoning-Aware AI for Photovoltaic Defect Inspection

Dev Mistry, Feng Qiu, Bo Chen, Feng Liu, Can Chen, Mohammad Shahidehpour, Ren Wang

Abstract

Reliable photovoltaic defect identification is essential for maintaining energy yield, ensuring warranty compliance, and enabling scalable inspection of rapidly expanding solar fleets. Although recent advances in computer vision have improved automated defect detection, most existing systems operate as opaque classifiers that provide limited diagnostic insight for high-stakes energy infrastructure. Here we introduce REVL-PV, a vision-language framework that embeds domain-specific diagnostic reasoning into multimodal learning across electroluminescence, thermal, and visible-light imagery. By requiring the model to link visual evidence to plausible defect mechanisms before classification, the framework produces structured diagnostic reports aligned with professional photovoltaic inspection practice. Evaluated on 1,927 real-world modules spanning eight defect categories, REVL-PV achieves 93\% classification accuracy while producing interpretable diagnostic rationales and maintaining strong robustness under realistic image corruptions. A blind concordance study with a certified solar inspection expert shows strong semantic alignment between model explanations and expert assessments across defect identification, root-cause attribution, and visual descriptions. These results demonstrate that reasoning-aware multimodal learning establishes a general paradigm for trustworthy AI-assisted inspection of photovoltaic energy infrastructure.

From Prediction to Diagnosis: Reasoning-Aware AI for Photovoltaic Defect Inspection

Abstract

Reliable photovoltaic defect identification is essential for maintaining energy yield, ensuring warranty compliance, and enabling scalable inspection of rapidly expanding solar fleets. Although recent advances in computer vision have improved automated defect detection, most existing systems operate as opaque classifiers that provide limited diagnostic insight for high-stakes energy infrastructure. Here we introduce REVL-PV, a vision-language framework that embeds domain-specific diagnostic reasoning into multimodal learning across electroluminescence, thermal, and visible-light imagery. By requiring the model to link visual evidence to plausible defect mechanisms before classification, the framework produces structured diagnostic reports aligned with professional photovoltaic inspection practice. Evaluated on 1,927 real-world modules spanning eight defect categories, REVL-PV achieves 93\% classification accuracy while producing interpretable diagnostic rationales and maintaining strong robustness under realistic image corruptions. A blind concordance study with a certified solar inspection expert shows strong semantic alignment between model explanations and expert assessments across defect identification, root-cause attribution, and visual descriptions. These results demonstrate that reasoning-aware multimodal learning establishes a general paradigm for trustworthy AI-assisted inspection of photovoltaic energy infrastructure.

Paper Structure

This paper contains 1 section, 5 equations, 5 figures.

Table of Contents

  1. Abstract

Figures (5)

  • Figure 1: REVL-PV model overview.Stage 1: Class-Balanced Multimodal Data Curation. Photovoltaic inspection data are inherently heterogeneous, with defect categories differing in visual distinctiveness and imaging modalities (EL, thermal, visible-light) varying in diagnostic quality. To address this, inspection images from multiple public datasets are unified, preprocessed, and curated using class balancing and physically motivated augmentation to ensure consistent representation of rare and subtle defects across modalities. Stage 2: Reasoning-Boosted Supervised Fine-Tuning (RSFT). The vision-language backbone is fine-tuned using generated, reasoning-dense samples to embed diagnostic logic via <think> and <answer> tokens. Stage 3: Two-Phase Reasoning Enhancement (2PRE). We establish text-only Evidence$\rightarrow$Cause$\rightarrow$Action coherence, then ground this logic visually using rule-based reinforcement learning. Stage 4: Robust Inference. Test-Time Augmentation (TTA) aggregates multi-crop predictions to finalize the defect diagnosis and its underlying rationale.
  • Figure 2: Quantitative evaluation of the REVL-PV framework.a. Baseline Model Comparison. The proposed method achieves a peak accuracy of 93%, outperforming both general vision models and specialized object detectors like YOLOv11x. b. Ablation of Training Stages. SStep-by-step performance gains illustrate the compounding value of the reasoning pipeline. While standard supervised fine-tuning (SFT) reaches only 72.0%, the transition to RSFT and subsequent 2PRE phases systematically refines the model's reasoning capabilities, steadily driving accuracy to 93.0c. Per-Class F1 Scores. REVL-PV exhibits a consistently superior performance envelope across all defect categories. d. Risk-Coverage Curve. In the operationally critical high-coverage regime ($\ge$50%, shaded) most relevant to automated inspection deployment, REVL-PV avoids the escalating overconfidence of YOLOv11x, maintaining a highly stable risk profile and a superior partial AURC50-100.
  • Figure 3: Model robustness and predictive reliability under physical signal degradation.a. Robustness under Noise Degradation. While the baseline YOLOv11x fails catastrophically at minimal noise levels (dropping from 86% to 43% accuracy at Severity 1, $\sigma=0.01$), the reasoning-aware REVL-PV framework degrades gracefully. Even under the most severe visual corruption (Severity 5, $\sigma=0.15$), REVL-PV retains a strong 63% top-1 accuracy, outperforming YOLOv11x by 24 percentage points. b. Risk-Coverage Curve at Noise severity 1. In the operationally relevant high-coverage regime ($\ge$50%), REVL-PV maintains stable uncertainty calibration, holding a consistently low risk profile (e.g., a 13.6% error rate at 90% coverage). In contrast, YOLOv11x exhibits severe overconfidence and escalating risk-error rates, surging to 53.1% at 90% coverage. This stability is quantified by the partial Area Under the Risk-Coverage curve ($\text{AURC}_{50-100}$), where REVL-PV achieves 0.0682 compared to YOLOv11x's 0.2330.
  • Figure 4: Qualitative comparison of diagnostic outputs. Representative predictions from GPT-5 (zero-shot), Gemini 2.5 Flash (fine-tuned), a certified solar panel expert and REVL-PV across three test cases. General-purpose models like GPT-5 hallucinated in Case 3, misclassifying a "Clean Panel’’ as "snail-trail/microcrack", a false positive. Similarly, in the case of Gemini 2.5 Flash, despite being fine-tuned exhibit critical failures including fabricated visual features and arithmetically inconsistent confidence scores. To establish a ground-truth baseline, the expert evaluated the images under fully blind conditions using structured assessment that mirrored the model's unified eight-class defect taxonomy and output format. REVL-PV produces fully structured outputs including defect type, calibrated probability distributions, root cause, and recommended action, with no hallucinations across all cases.
  • Figure 5: Semantic alignment of generated diagnostic reports. The charts display BERTScore metrics (Precision, Recall, F1) measuring the linguistic similarity between the model's textual output and the human expert's ground truth descriptions across four diagnostic components. a–b. Classification Alignment. The near-perfect scores in Primary Classification (F1: 98.4%) and Alternative Classification (F1: 92.9%) indicate that the model utilizes the exact industry-standard terminology required for formal reporting. c. Causal Reasoning. The high Recall (90.2%) in the Root Cause analysis demonstrates that the model successfully retrieves the vast majority of underlying factors identified by experts (e.g., thermal stress, manufacturing defects) while ensuring no critical warnings are omitted. d. Descriptive Accuracy. The strong performance in Visual Description (F1: 86.9%) confirms the model's ability to articulate complex visual traits with a level of detail comparable to expert analysis.