Table of Contents
Fetching ...

RoadBench: A Vision-Language Foundation Model and Benchmark for Road Damage Understanding

Xi Xiao, Yunbei Zhang, Janet Wang, Lin Zhao, Yuxiang Wei, Hengjia Li, Yanshu Li, Xinyuan Song, Xiao Wang, Swalpa Kumar Roy, Hao Xu, Tianyang Wang

TL;DR

RoadBench addresses the lack of multimodal context in road damage understanding by pairing high-resolution road images with textual descriptions and introducing RoadCLIP, a vision-language model tailored to road damages. The model integrates Disease-aware Positional Encoding (DaPE) and Domain-Specific Prior Injection to align image regions with damage semantics, trained via a GPT-driven data generation pipeline. Experiments show RoadCLIP achieves state-of-the-art performance on road damage recognition and cross-modal retrieval, significantly surpassing vision-only baselines. This work provides a large-scale, domain-specific benchmark and a foundation for robust multimodal infrastructure monitoring.

Abstract

Accurate road damage detection is crucial for timely infrastructure maintenance and public safety, but existing vision-only datasets and models lack the rich contextual understanding that textual information can provide. To address this limitation, we introduce RoadBench, the first multimodal benchmark for comprehensive road damage understanding. This dataset pairs high resolution images of road damages with detailed textual descriptions, providing a richer context for model training. We also present RoadCLIP, a novel vision language model that builds upon CLIP by integrating domain specific enhancements. It includes a disease aware positional encoding that captures spatial patterns of road defects and a mechanism for injecting road-condition priors to refine the model's understanding of road damages. We further employ a GPT driven data generation pipeline to expand the image to text pairs in RoadBench, greatly increasing data diversity without exhaustive manual annotation. Experiments demonstrate that RoadCLIP achieves state of the art performance on road damage recognition tasks, significantly outperforming existing vision-only models by 19.2%. These results highlight the advantages of integrating visual and textual information for enhanced road condition analysis, setting new benchmarks for the field and paving the way for more effective infrastructure monitoring through multimodal learning.

RoadBench: A Vision-Language Foundation Model and Benchmark for Road Damage Understanding

TL;DR

RoadBench addresses the lack of multimodal context in road damage understanding by pairing high-resolution road images with textual descriptions and introducing RoadCLIP, a vision-language model tailored to road damages. The model integrates Disease-aware Positional Encoding (DaPE) and Domain-Specific Prior Injection to align image regions with damage semantics, trained via a GPT-driven data generation pipeline. Experiments show RoadCLIP achieves state-of-the-art performance on road damage recognition and cross-modal retrieval, significantly surpassing vision-only baselines. This work provides a large-scale, domain-specific benchmark and a foundation for robust multimodal infrastructure monitoring.

Abstract

Accurate road damage detection is crucial for timely infrastructure maintenance and public safety, but existing vision-only datasets and models lack the rich contextual understanding that textual information can provide. To address this limitation, we introduce RoadBench, the first multimodal benchmark for comprehensive road damage understanding. This dataset pairs high resolution images of road damages with detailed textual descriptions, providing a richer context for model training. We also present RoadCLIP, a novel vision language model that builds upon CLIP by integrating domain specific enhancements. It includes a disease aware positional encoding that captures spatial patterns of road defects and a mechanism for injecting road-condition priors to refine the model's understanding of road damages. We further employ a GPT driven data generation pipeline to expand the image to text pairs in RoadBench, greatly increasing data diversity without exhaustive manual annotation. Experiments demonstrate that RoadCLIP achieves state of the art performance on road damage recognition tasks, significantly outperforming existing vision-only models by 19.2%. These results highlight the advantages of integrating visual and textual information for enhanced road condition analysis, setting new benchmarks for the field and paving the way for more effective infrastructure monitoring through multimodal learning.

Paper Structure

This paper contains 19 sections, 7 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of the RoadBench benchmark and the RoadCLIP framework. Left: Sample image–text pairs synthesized in diverse road scenarios, capturing damage types (e.g., longitudinal cracks, potholes), weather conditions, spatial context, and surface materials. Right: Our RoadCLIP architecture leverages a dual-encoder backbone enhanced with Disease-aware Positional Encoding and Road Disease Prior Injection to align visual and textual features in a multimodal embedding space.
  • Figure 2: Overview of the RoadBench construction pipeline. Structured prompts describing road damage types and environments guide multimodal generation with GPT-4o. Human experts verify the generated image–text pairs, which are then annotated and compiled into a high-quality benchmark dataset with images, captions, and labels.
  • Figure 3: Category-wise proportion of road defect types in RoadBench.
  • Figure 4: Overall architecture of RoadCLIP. The model uses a dual-encoder CLIP-based architecture, projecting road images and damage descriptions into a shared space, trained using a symmetric contrastive loss. A Disease-aware Positional Encoding (DaPE) module adds spatial priors to the visual encoder, while a Domain-Specific Prior Injection module enriches both modalities.
  • Figure 5: Illustration of the text-region alignment process in RoadCLIP. RoadCLIP encodes text and image, and computes token-wise similarity between text and visual patches. This produces a cross-modal attention map that highlights semantically aligned regions in the image.
  • ...and 1 more figures