Table of Contents
Fetching ...

Synthetic Defect Image Generation for Power Line Insulator Inspection Using Multimodal Large Language Models

Xuesong Wang, Caisheng Wang

TL;DR

An off-the-shelf multimodal large language model (MLLM) is used as a training-free image generator to synthesize defect images from visual references and text prompts to suggest a practical, low-barrier path for improving defect recognition when collecting additional real defects is slow or infeasible.

Abstract

Utility companies increasingly rely on drone imagery for post-event and routine inspection, but training accurate defect-type classifiers remains difficult because defect examples are rare and inspection datasets are often limited or proprietary. We address this data-scarcity setting by using an off-the-shelf multimodal large language model (MLLM) as a training-free image generator to synthesize defect images from visual references and text prompts. Our pipeline increases diversity via dual-reference conditioning, improves label fidelity with lightweight human verification and prompt refinement, and filters the resulting synthetic pool using an embedding-based selection rule based on distances to class centroids computed from the real training split. We evaluate on ceramic insulator defect-type classification (shell vs. glaze) using a public dataset with a realistic low training-data regime (104 real training images; 152 validation; 308 test). Augmenting the 10% real training set with embedding-selected synthetic images improves test F1 score (harmonic mean of precision and recall) from 0.615 to 0.739 (20% relative), corresponding to an estimated 4--5x data-efficiency gain, and the gains persist with stronger backbone models and frozen-feature linear-probe baselines. These results suggest a practical, low-barrier path for improving defect recognition when collecting additional real defects is slow or infeasible.

Synthetic Defect Image Generation for Power Line Insulator Inspection Using Multimodal Large Language Models

TL;DR

An off-the-shelf multimodal large language model (MLLM) is used as a training-free image generator to synthesize defect images from visual references and text prompts to suggest a practical, low-barrier path for improving defect recognition when collecting additional real defects is slow or infeasible.

Abstract

Utility companies increasingly rely on drone imagery for post-event and routine inspection, but training accurate defect-type classifiers remains difficult because defect examples are rare and inspection datasets are often limited or proprietary. We address this data-scarcity setting by using an off-the-shelf multimodal large language model (MLLM) as a training-free image generator to synthesize defect images from visual references and text prompts. Our pipeline increases diversity via dual-reference conditioning, improves label fidelity with lightweight human verification and prompt refinement, and filters the resulting synthetic pool using an embedding-based selection rule based on distances to class centroids computed from the real training split. We evaluate on ceramic insulator defect-type classification (shell vs. glaze) using a public dataset with a realistic low training-data regime (104 real training images; 152 validation; 308 test). Augmenting the 10% real training set with embedding-selected synthetic images improves test F1 score (harmonic mean of precision and recall) from 0.615 to 0.739 (20% relative), corresponding to an estimated 4--5x data-efficiency gain, and the gains persist with stronger backbone models and frozen-feature linear-probe baselines. These results suggest a practical, low-barrier path for improving defect recognition when collecting additional real defects is slow or infeasible.
Paper Structure (31 sections, 5 figures, 7 tables)

This paper contains 31 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Pseudocode for the proposed iterative generation, verification, and prompt-refinement loop.
  • Figure 2: Dual-reference generation demonstration.
  • Figure 3: Human-in-the-loop verification interface.
  • Figure 4: Baseline vs. RandAugment across training fractions. RandAugment shows inconsistent behavior and does not reliably improve defect-type recognition in the data-scarce regime.
  • Figure 5: Test F1 performance by synthetic batch for dual-reference generations. V1 prompt batches (0--2) shown in purple, V2 prompt batches (3--7) shown in orange. Dashed line indicates 10% baseline (0.615).