Deep Learning and Natural Language Processing in the Field of Construction
Rémy Kessler, Nicolas Béchet
TL;DR
This work tackles building a construction-domain knowledge model by automatically extracting terminology from technical specifications and detecting hypernym relations among terms. It combines a lexical-statistical terminology extraction pipeline with both embedding-based and end-to-end hypernym detection methods, evaluated across French and English datasets including the domain-specific VOCAGEN corpus. Key contributions include a multi-stage terminology extraction system with pattern-based filtering and web pruning, a comprehensive comparison of embedding-based and LLM-based hypernym detection, and strong performance, notably with CamemBERT-based models on VOCAGEN and LLMs on standard benchmarks. The findings demonstrate the practicality of domain transfer for ontology construction and point to future directions such as ensemble fusion and knowledge-model-informed predictions to further enhance taxonomy development and deployment in construction contexts.
Abstract
This article presents a complete process to extract hypernym relationships in the field of construction using two main steps: terminology extraction and detection of hypernyms from these terms. We first describe the corpus analysis method to extract terminology from a collection of technical specifications in the field of construction. Using statistics and word n-grams analysis, we extract the domain's terminology and then perform pruning steps with linguistic patterns and internet queries to improve the quality of the final terminology. Second, we present a machine-learning approach based on various words embedding models and combinations to deal with the detection of hypernyms from the extracted terminology. Extracted terminology is evaluated using a manual evaluation carried out by 6 experts in the domain, and the hypernym identification method is evaluated with different datasets. The global approach provides relevant and promising results.
