Table of Contents
Fetching ...

Deep Learning and Natural Language Processing in the Field of Construction

Rémy Kessler, Nicolas Béchet

TL;DR

This work tackles building a construction-domain knowledge model by automatically extracting terminology from technical specifications and detecting hypernym relations among terms. It combines a lexical-statistical terminology extraction pipeline with both embedding-based and end-to-end hypernym detection methods, evaluated across French and English datasets including the domain-specific VOCAGEN corpus. Key contributions include a multi-stage terminology extraction system with pattern-based filtering and web pruning, a comprehensive comparison of embedding-based and LLM-based hypernym detection, and strong performance, notably with CamemBERT-based models on VOCAGEN and LLMs on standard benchmarks. The findings demonstrate the practicality of domain transfer for ontology construction and point to future directions such as ensemble fusion and knowledge-model-informed predictions to further enhance taxonomy development and deployment in construction contexts.

Abstract

This article presents a complete process to extract hypernym relationships in the field of construction using two main steps: terminology extraction and detection of hypernyms from these terms. We first describe the corpus analysis method to extract terminology from a collection of technical specifications in the field of construction. Using statistics and word n-grams analysis, we extract the domain's terminology and then perform pruning steps with linguistic patterns and internet queries to improve the quality of the final terminology. Second, we present a machine-learning approach based on various words embedding models and combinations to deal with the detection of hypernyms from the extracted terminology. Extracted terminology is evaluated using a manual evaluation carried out by 6 experts in the domain, and the hypernym identification method is evaluated with different datasets. The global approach provides relevant and promising results.

Deep Learning and Natural Language Processing in the Field of Construction

TL;DR

This work tackles building a construction-domain knowledge model by automatically extracting terminology from technical specifications and detecting hypernym relations among terms. It combines a lexical-statistical terminology extraction pipeline with both embedding-based and end-to-end hypernym detection methods, evaluated across French and English datasets including the domain-specific VOCAGEN corpus. Key contributions include a multi-stage terminology extraction system with pattern-based filtering and web pruning, a comprehensive comparison of embedding-based and LLM-based hypernym detection, and strong performance, notably with CamemBERT-based models on VOCAGEN and LLMs on standard benchmarks. The findings demonstrate the practicality of domain transfer for ontology construction and point to future directions such as ensemble fusion and knowledge-model-informed predictions to further enhance taxonomy development and deployment in construction contexts.

Abstract

This article presents a complete process to extract hypernym relationships in the field of construction using two main steps: terminology extraction and detection of hypernyms from these terms. We first describe the corpus analysis method to extract terminology from a collection of technical specifications in the field of construction. Using statistics and word n-grams analysis, we extract the domain's terminology and then perform pruning steps with linguistic patterns and internet queries to improve the quality of the final terminology. Second, we present a machine-learning approach based on various words embedding models and combinations to deal with the detection of hypernyms from the extracted terminology. Extracted terminology is evaluated using a manual evaluation carried out by 6 experts in the domain, and the hypernym identification method is evaluated with different datasets. The global approach provides relevant and promising results.
Paper Structure (34 sections, 4 equations, 10 figures)

This paper contains 34 sections, 4 equations, 10 figures.

Figures (10)

  • Figure 1: figure describing the context of the project
  • Figure 2: figure describing the final process chain
  • Figure 3: statistics of the collection.
  • Figure 4: System overview
  • Figure 5: Distribution of linguistic patterns according to the knowledge model.
  • ...and 5 more figures