Table of Contents
Fetching ...

CODE-ACCORD: A Corpus of building regulatory data for rule generation towards automatic compliance checking

Hansi Hettiarachchi, Amna Dridi, Mohamed Medhat Gaber, Pouyan Parsafard, Nicoleta Bocaneala, Katja Breitenfelder, Gonçal Costa, Maria Hedblom, Mihaela Juganaru-Mathieu, Thamer Mecharnia, Sumee Park, He Tan, Abdel-Rahman H. Tawil, Edlira Vakaj

TL;DR

CODE-ACCORD tackles the challenge of converting unstructured building regulation text into machine-readable rules for automatic compliance checking (ACC) in the AEC domain. It builds a corpus from 862 self-contained sentences in England and Finland, applying semi-automatic sentence collection and rigorous manual annotation to produce 4,297 entities across four categories and 4,329 relations across ten categories, with train/test splits. The work introduces a generalizable annotation strategy for cross-domain regulatory data and provides open access to the dataset and annotations, enabling supervised learning and transformer-based methods for entity recognition and relation extraction. This resource lowers the barrier to automated ACC development and supports scalable integration of NLP/ML tools into building regulation compliance workflows.

Abstract

Automatic Compliance Checking (ACC) within the Architecture, Engineering, and Construction (AEC) sector necessitates automating the interpretation of building regulations to achieve its full potential. Converting textual rules into machine-readable formats is challenging due to the complexities of natural language and the scarcity of resources for advanced Machine Learning (ML). Addressing these challenges, we introduce CODE-ACCORD, a dataset of 862 sentences from the building regulations of England and Finland. Only the self-contained sentences, which express complete rules without needing additional context, were considered as they are essential for ACC. Each sentence was manually annotated with entities and relations by a team of 12 annotators to facilitate machine-readable rule generation, followed by careful curation to ensure accuracy. The final dataset comprises 4,297 entities and 4,329 relations across various categories, serving as a robust ground truth. CODE-ACCORD supports a range of ML and Natural Language Processing (NLP) tasks, including text classification, entity recognition, and relation extraction. It enables applying recent trends, such as deep neural networks and large language models, to ACC.

CODE-ACCORD: A Corpus of building regulatory data for rule generation towards automatic compliance checking

TL;DR

CODE-ACCORD tackles the challenge of converting unstructured building regulation text into machine-readable rules for automatic compliance checking (ACC) in the AEC domain. It builds a corpus from 862 self-contained sentences in England and Finland, applying semi-automatic sentence collection and rigorous manual annotation to produce 4,297 entities across four categories and 4,329 relations across ten categories, with train/test splits. The work introduces a generalizable annotation strategy for cross-domain regulatory data and provides open access to the dataset and annotations, enabling supervised learning and transformer-based methods for entity recognition and relation extraction. This resource lowers the barrier to automated ACC development and supports scalable integration of NLP/ML tools into building regulation compliance workflows.

Abstract

Automatic Compliance Checking (ACC) within the Architecture, Engineering, and Construction (AEC) sector necessitates automating the interpretation of building regulations to achieve its full potential. Converting textual rules into machine-readable formats is challenging due to the complexities of natural language and the scarcity of resources for advanced Machine Learning (ML). Addressing these challenges, we introduce CODE-ACCORD, a dataset of 862 sentences from the building regulations of England and Finland. Only the self-contained sentences, which express complete rules without needing additional context, were considered as they are essential for ACC. Each sentence was manually annotated with entities and relations by a team of 12 annotators to facilitate machine-readable rule generation, followed by careful curation to ensure accuracy. The final dataset comprises 4,297 entities and 4,329 relations across various categories, serving as a robust ground truth. CODE-ACCORD supports a range of ML and Natural Language Processing (NLP) tasks, including text classification, entity recognition, and relation extraction. It enables applying recent trends, such as deep neural networks and large language models, to ACC.
Paper Structure (10 sections, 1 equation, 10 figures, 8 tables)

This paper contains 10 sections, 1 equation, 10 figures, 8 tables.

Figures (10)

  • Figure 1: The Semi-Automatic CODE-ACCORD Data Preparation Methodology
  • Figure 2: Sample of entity labels in BIO format
  • Figure 3: Distribution of entity categories
  • Figure 4: Distribution of the number of entities per sentence
  • Figure 5: Sequence length distribution of annotated text spans as entities
  • ...and 5 more figures