Table of Contents
Fetching ...

Link Prediction for Event Logs in the Process Industry

Anastasia Zhukova, Thomas Walton, Christian E. Lobmüller, Bela Gipp

TL;DR

This work demonstrates that common NLP tasks can be combined and adapted to a domain-specific setting of the German process industry, improving data quality and connectivity in shift logs.

Abstract

In the era of graph-based retrieval-augmented generation (RAG), link prediction is a significant preprocessing step for improving the quality of fragmented or incomplete domain-specific data for the graph retrieval. Knowledge management in the process industry uses RAG-based applications to optimize operations, ensure safety, and facilitate continuous improvement by effectively leveraging operational data and past insights. A key challenge in this domain is the fragmented nature of event logs in shift books, where related records are often kept separate, even though they belong to a single event or process. This fragmentation hinders the recommendation of previously implemented solutions to users, which is crucial in the timely problem-solving at live production sites. To address this problem, we develop a record linking (RL) model, which we define as a cross-document coreference resolution (CDCR) task. RL adapts the task definition of CDCR and combines two state-of-the-art CDCR models with the principles of natural language inference (NLI) and semantic text similarity (STS) to perform link prediction. The evaluation shows that our RL model outperformed the best versions of our baselines, i.e., NLP and STS, by 28% (11.43 p) and 27.4% (11.21 p), respectively. Our work demonstrates that common NLP tasks can be combined and adapted to a domain-specific setting of the German process industry, improving data quality and connectivity in shift logs.

Link Prediction for Event Logs in the Process Industry

TL;DR

This work demonstrates that common NLP tasks can be combined and adapted to a domain-specific setting of the German process industry, improving data quality and connectivity in shift logs.

Abstract

In the era of graph-based retrieval-augmented generation (RAG), link prediction is a significant preprocessing step for improving the quality of fragmented or incomplete domain-specific data for the graph retrieval. Knowledge management in the process industry uses RAG-based applications to optimize operations, ensure safety, and facilitate continuous improvement by effectively leveraging operational data and past insights. A key challenge in this domain is the fragmented nature of event logs in shift books, where related records are often kept separate, even though they belong to a single event or process. This fragmentation hinders the recommendation of previously implemented solutions to users, which is crucial in the timely problem-solving at live production sites. To address this problem, we develop a record linking (RL) model, which we define as a cross-document coreference resolution (CDCR) task. RL adapts the task definition of CDCR and combines two state-of-the-art CDCR models with the principles of natural language inference (NLI) and semantic text similarity (STS) to perform link prediction. The evaluation shows that our RL model outperformed the best versions of our baselines, i.e., NLP and STS, by 28% (11.43 p) and 27.4% (11.21 p), respectively. Our work demonstrates that common NLP tasks can be combined and adapted to a domain-specific setting of the German process industry, improving data quality and connectivity in shift logs.

Paper Structure

This paper contains 16 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Efficiency and accuracy of the knowledge management (KM) applications, such as RAG as a domain-specific solution recommender system, strongly rely on the record connectivity in a knowledge graph (KG). Record linking (RL) performs a preprocessing step for link prediction in text logs that report on tasks, problems, and solutions in the production plant, linking records that are part of the same story but were reported as updates to the event.
  • Figure 2: Mapping of the CDCR definitions to the record linking (RL) task.
  • Figure 3: The proposed CDCR-driven record linking (RL) model. Compared to most of the state-of-the-art CDCR models cattan-etal-2021-cross-documenteirew-etal-2021-wecbugert-gurevych-2021-event, our joint encoding of the records is enhanced by a joint encoding stemming from the vectors of the [CLS] token caciularu-etal-2021-cdlm-cross and a feature vector based on the similarity of the records' attributes barhom-etal-2019-revisiting.
  • Figure 4: The comparison of evaluation on the topic vs. subtopic level in computational effort in computing the similarity matrices. The subtopic level saves computational effort by avoiding the computation of unnecessary scores between temporally distant mentions. Some original chains may be split by the subtopic time frame; therefore, a sliding subtopic window is required to evaluate all parts of the original chains.
  • Figure 5: RL performance on a topic level. The proposed RL with daGBERT+FL outperformed all other modifications across almost all topics when using tDFS.