Table of Contents
Fetching ...

Self-Supervised Visual Prompting for Cross-Domain Road Damage Detection

Xi Xiao, Zhuxuanzi Wang, Mingqiao Mo, Chen Liu, Chenrui Ma, Yanshu Li, Smita Krishnaswamy, Xiao Wang, Tianyang Wang

TL;DR

PROBE tackles cross-domain road damage detection by marrying self-supervised prompting with domain-aware alignment. It introduces SPEM, which derives target-specific visual prompts from unlabeled target data by clustering target patch embeddings after dimensionality reduction, and injects these prompts into a frozen Vision Transformer to bias defect-focused representations. Complementarily, DAPA aligns prompt-conditioned source and target features using a lightweight linear-kernel MMD objective, enabling robust cross-domain transfer without heavy backbone fine-tuning. Across four challenging benchmarks, PROBE achieves state-of-the-art performance in zero-shot and few-shot settings, demonstrating strong cross-domain robustness, data efficiency, and practical parameter efficiency. The approach highlights prompting as a scalable mechanism for self-supervised, domain-adaptive vision systems in safety-critical infrastructure inspection.

Abstract

The deployment of automated pavement defect detection is often hindered by poor cross-domain generalization. Supervised detectors achieve strong in-domain accuracy but require costly re-annotation for new environments, while standard self-supervised methods capture generic features and remain vulnerable to domain shift. We propose \ours, a self-supervised framework that \emph{visually probes} target domains without labels. \ours introduces a Self-supervised Prompt Enhancement Module (SPEM), which derives defect-aware prompts from unlabeled target data to guide a frozen ViT backbone, and a Domain-Aware Prompt Alignment (DAPA) objective, which aligns prompt-conditioned source and target representations. Experiments on four challenging benchmarks show that \ours consistently outperforms strong supervised, self-supervised, and adaptation baselines, achieving robust zero-shot transfer, improved resilience to domain variations, and high data efficiency in few-shot adaptation. These results highlight self-supervised prompting as a practical direction for building scalable and adaptive visual inspection systems. Source code is publicly available: https://github.com/xixiaouab/PROBE/tree/main

Self-Supervised Visual Prompting for Cross-Domain Road Damage Detection

TL;DR

PROBE tackles cross-domain road damage detection by marrying self-supervised prompting with domain-aware alignment. It introduces SPEM, which derives target-specific visual prompts from unlabeled target data by clustering target patch embeddings after dimensionality reduction, and injects these prompts into a frozen Vision Transformer to bias defect-focused representations. Complementarily, DAPA aligns prompt-conditioned source and target features using a lightweight linear-kernel MMD objective, enabling robust cross-domain transfer without heavy backbone fine-tuning. Across four challenging benchmarks, PROBE achieves state-of-the-art performance in zero-shot and few-shot settings, demonstrating strong cross-domain robustness, data efficiency, and practical parameter efficiency. The approach highlights prompting as a scalable mechanism for self-supervised, domain-adaptive vision systems in safety-critical infrastructure inspection.

Abstract

The deployment of automated pavement defect detection is often hindered by poor cross-domain generalization. Supervised detectors achieve strong in-domain accuracy but require costly re-annotation for new environments, while standard self-supervised methods capture generic features and remain vulnerable to domain shift. We propose \ours, a self-supervised framework that \emph{visually probes} target domains without labels. \ours introduces a Self-supervised Prompt Enhancement Module (SPEM), which derives defect-aware prompts from unlabeled target data to guide a frozen ViT backbone, and a Domain-Aware Prompt Alignment (DAPA) objective, which aligns prompt-conditioned source and target representations. Experiments on four challenging benchmarks show that \ours consistently outperforms strong supervised, self-supervised, and adaptation baselines, achieving robust zero-shot transfer, improved resilience to domain variations, and high data efficiency in few-shot adaptation. These results highlight self-supervised prompting as a practical direction for building scalable and adaptive visual inspection systems. Source code is publicly available: https://github.com/xixiaouab/PROBE/tree/main

Paper Structure

This paper contains 83 sections, 17 equations, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: Overview of PROBE. The framework targets cross-domain road damage detection with a frozen ViT-B/16 backbone and two key modules: (i) SPEM (bottom-right) converts unlabeled target-domain patch embeddings into visual prototypes via PCA + K-means and projects them with a shallow MLP into prompt tokens, which are injected at shallow and mid transformer layers to emphasize defect-relevant semantics; (ii) DAPA (top-right) aligns prompt-enhanced source/target features using a linear-kernel MMD in the prompt-conditioned space. A lightweight detection head (right) with two conv blocks and a $1{\times}1$ prediction layer is trained on a small labeled source subset (few-shot target labels optional).
  • Figure 2: Impact of the number of prompts ($K$) on cross-domain mAP (%). Performance peaks at $K=10$, after which it slightly declines. Results are averaged over three runs.
  • Figure 3: Impact of injection depth on cross-domain mAP (%). A multi-stage strategy (Shallow+Mid) consistently outperforms single-layer injection.
  • Figure 4: Detection results of different state-of-the-art methods across multiple domains (Snow, Desert, Forest, and City). Our method achieves more robust detection under diverse environments compared with CDTrans and MGD-YOLO.
  • Figure 5: Heatmap++ visualization of focus regions across domains. We visualize the core focus regions of different methods under four domains (Snow, Desert, Forest, City). Heatmap++ highlights where models attend when predicting defects. Compared to CDTrans and MGD-YOLO, our method concentrates more precisely on defect areas (cracks/potholes) and suppresses background textures, showing superior cross-domain localization.
  • ...and 6 more figures