Table of Contents
Fetching ...

OpenConstruction: A Systematic Synthesis of Open Visual Datasets for Data-Centric Artificial Intelligence in Construction Monitoring

Ruoxin Xiong, Yanyu Wang, Jiannan Cai, Kaijian Liu, Yuansheng Zhu, Pingbo Tang, Nora El-Gohary

TL;DR

This work addresses fragmentation in open visual datasets for construction monitoring by conducting a systematic review of 51 publicly available datasets (2005–2024) and introducing the OpenConstruction catalog with a standardized data schema. It analyzes data fundamentals, modalities, annotations, and tasks, revealing a predominance of RGB imagery, bounding-box annotations, and limited multimodal or temporal data with heterogeneous licensing. The authors articulate four core gaps—sensing, semantics, context/interoperability, and governance—and propose a FAIR-aligned roadmap with four pillars (multimodal data acquisition, ontology consistency, contextual benchmarking, governance). The catalog and roadmap aim to enhance data discoverability, reproducibility, and practical adoption of data-centric AI in construction monitoring, with implications for safety, productivity, and digital-twin integration.

Abstract

The construction industry increasingly relies on visual data to support Artificial Intelligence (AI) and Machine Learning (ML) applications for site monitoring. High-quality, domain-specific datasets, comprising images, videos, and point clouds, capture site geometry and spatiotemporal dynamics, including the location and interaction of objects, workers, and materials. However, despite growing interest in leveraging visual datasets, existing resources vary widely in sizes, data modalities, annotation quality, and representativeness of real-world construction conditions. A systematic review to categorize their data characteristics and application contexts is still lacking, limiting the community's ability to fully understand the dataset landscape, identify critical gaps, and guide future directions toward more effective, reliable, and scalable AI applications in construction. To address this gap, this study conducts an extensive search of academic databases and open-data platforms, yielding 51 publicly available visual datasets that span the 2005-2024 period. These datasets are categorized using a structured data schema covering (i) data fundamentals (e.g., size and license), (ii) data modalities (e.g., RGB and point cloud), (iii) annotation frameworks (e.g., bounding boxes), and (iv) downstream application domains (e.g., progress tracking). This study synthesizes these findings into an open-source catalog, OpenConstruction, supporting data-driven method development. Furthermore, the study discusses several critical limitations in the existing construction dataset landscape and presents a roadmap for future data infrastructure anchored in the Findability, Accessibility, Interoperability, and Reusability (FAIR) principles. By reviewing the current landscape and outlining strategic priorities, this study supports the advancement of data-centric solutions in the construction sector.

OpenConstruction: A Systematic Synthesis of Open Visual Datasets for Data-Centric Artificial Intelligence in Construction Monitoring

TL;DR

This work addresses fragmentation in open visual datasets for construction monitoring by conducting a systematic review of 51 publicly available datasets (2005–2024) and introducing the OpenConstruction catalog with a standardized data schema. It analyzes data fundamentals, modalities, annotations, and tasks, revealing a predominance of RGB imagery, bounding-box annotations, and limited multimodal or temporal data with heterogeneous licensing. The authors articulate four core gaps—sensing, semantics, context/interoperability, and governance—and propose a FAIR-aligned roadmap with four pillars (multimodal data acquisition, ontology consistency, contextual benchmarking, governance). The catalog and roadmap aim to enhance data discoverability, reproducibility, and practical adoption of data-centric AI in construction monitoring, with implications for safety, productivity, and digital-twin integration.

Abstract

The construction industry increasingly relies on visual data to support Artificial Intelligence (AI) and Machine Learning (ML) applications for site monitoring. High-quality, domain-specific datasets, comprising images, videos, and point clouds, capture site geometry and spatiotemporal dynamics, including the location and interaction of objects, workers, and materials. However, despite growing interest in leveraging visual datasets, existing resources vary widely in sizes, data modalities, annotation quality, and representativeness of real-world construction conditions. A systematic review to categorize their data characteristics and application contexts is still lacking, limiting the community's ability to fully understand the dataset landscape, identify critical gaps, and guide future directions toward more effective, reliable, and scalable AI applications in construction. To address this gap, this study conducts an extensive search of academic databases and open-data platforms, yielding 51 publicly available visual datasets that span the 2005-2024 period. These datasets are categorized using a structured data schema covering (i) data fundamentals (e.g., size and license), (ii) data modalities (e.g., RGB and point cloud), (iii) annotation frameworks (e.g., bounding boxes), and (iv) downstream application domains (e.g., progress tracking). This study synthesizes these findings into an open-source catalog, OpenConstruction, supporting data-driven method development. Furthermore, the study discusses several critical limitations in the existing construction dataset landscape and presents a roadmap for future data infrastructure anchored in the Findability, Accessibility, Interoperability, and Reusability (FAIR) principles. By reviewing the current landscape and outlining strategic priorities, this study supports the advancement of data-centric solutions in the construction sector.

Paper Structure

This paper contains 34 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Framework for the search, screening, and synthesis of open-access visual datasets in construction monitoring
  • Figure 2: Temporal and geographic distribution of the identified construction visual datasets. Datasets without specified collection locations are excluded.
  • Figure 3: Schema for characterizing open visual datasets in construction monitoring
  • Figure 4: Distribution of dataset sizes and their coverage across tasks
  • Figure 5: Overview of dataset characteristics in license types and data modalities (note: some datasets include multiple modalities).
  • ...and 4 more figures