Table of Contents
Fetching ...

Increasing the Accessibility of Causal Domain Knowledge via Causal Information Extraction Methods: A Case Study in the Semiconductor Manufacturing Industry

Houssam Razouk, Leonie Benischke, Daniel Garber, Roman Kern

TL;DR

The paper tackles the challenge of making causal domain knowledge in the semiconductor industry more accessible by automatically extracting causal information from unstructured and semi-structured documents. It develops two extraction paradigms, SST and MST, and investigates domain-adaptive pretraining with UM and PMI masking on BERT-based models, evaluating on FMEA and presentation slides. MST outperforms SST, achieving $93\%$ $F1$ on FMEA and $73\%$ $F1$ on slides, with domain-aligned models and in-domain fine-tuning providing additional gains, especially for enchained and disrupted relations. The work offers annotated data, practical annotation guidelines, and methodological guidance for practitioners to convert textual causal knowledge into structured representations, enabling improved downstream analysis in industrial settings.

Abstract

The extraction of causal information from textual data is crucial in the industry for identifying and mitigating potential failures, enhancing process efficiency, prompting quality improvements, and addressing various operational challenges. This paper presents a study on the development of automated methods for causal information extraction from actual industrial documents in the semiconductor manufacturing industry. The study proposes two types of causal information extraction methods, single-stage sequence tagging (SST) and multi-stage sequence tagging (MST), and evaluates their performance using existing documents from a semiconductor manufacturing company, including presentation slides and FMEA (Failure Mode and Effects Analysis) documents. The study also investigates the effect of representation learning on downstream tasks. The presented case study showcases that the proposed MST methods for extracting causal information from industrial documents are suitable for practical applications, especially for semi structured documents such as FMEAs, with a 93\% F1 score. Additionally, MST achieves a 73\% F1 score on texts extracted from presentation slides. Finally, the study highlights the importance of choosing a language model that is more aligned with the domain and in-domain fine-tuning.

Increasing the Accessibility of Causal Domain Knowledge via Causal Information Extraction Methods: A Case Study in the Semiconductor Manufacturing Industry

TL;DR

The paper tackles the challenge of making causal domain knowledge in the semiconductor industry more accessible by automatically extracting causal information from unstructured and semi-structured documents. It develops two extraction paradigms, SST and MST, and investigates domain-adaptive pretraining with UM and PMI masking on BERT-based models, evaluating on FMEA and presentation slides. MST outperforms SST, achieving on FMEA and on slides, with domain-aligned models and in-domain fine-tuning providing additional gains, especially for enchained and disrupted relations. The work offers annotated data, practical annotation guidelines, and methodological guidance for practitioners to convert textual causal knowledge into structured representations, enabling improved downstream analysis in industrial settings.

Abstract

The extraction of causal information from textual data is crucial in the industry for identifying and mitigating potential failures, enhancing process efficiency, prompting quality improvements, and addressing various operational challenges. This paper presents a study on the development of automated methods for causal information extraction from actual industrial documents in the semiconductor manufacturing industry. The study proposes two types of causal information extraction methods, single-stage sequence tagging (SST) and multi-stage sequence tagging (MST), and evaluates their performance using existing documents from a semiconductor manufacturing company, including presentation slides and FMEA (Failure Mode and Effects Analysis) documents. The study also investigates the effect of representation learning on downstream tasks. The presented case study showcases that the proposed MST methods for extracting causal information from industrial documents are suitable for practical applications, especially for semi structured documents such as FMEAs, with a 93\% F1 score. Additionally, MST achieves a 73\% F1 score on texts extracted from presentation slides. Finally, the study highlights the importance of choosing a language model that is more aligned with the domain and in-domain fine-tuning.

Paper Structure

This paper contains 12 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Proposed method for causal information extraction from various industrial documents. For the extraction of causal information, texts are gathered from tabular FMEA documents and industrial presentation slides. The causal entities and relations in these texts are then annotated following specific annotation guidelines, as described in Section \ref{['subchap:Annotation']}. The depicted example illustrates a text extracted from and FMEA cell which contain two causal relations. For higher generalizability, meaningful text representation is generated using various language models. This representation is then used to train different sequence tagging models for causal information extraction. Namely multi-stage sequence tagging approach which uses a cascade of models for different tasks is depicted.
  • Figure 2: Automated Causal Information Extraction from Text. The proposed approach for causal information extraction from text utilizes a BERT-based language model that is selected based on the relevance of its initial training data set to the domain of interest. The model is initially fine-tuned for the domain using a masked language modeling objective, with two masking strategies (UM and PMI) being compared. A portion of the annotated data set is used to train the model, and two causal information extraction methods are compared: single-stage sequence tagging (SST) and multi-stage sequence tagging (MST). SST, based on multi-label token classification, is capable of detecting overlapped entities. MST cascades multiple models, including a binary classifier for trigger detection, a binary classifier for trigger grouping, an attention network for trigger combined embedding, and a multi-label classifier for argument detection. In addition to detecting overlapped entities, MST is capable of extracting enchained relations and detecting disrupted entities.