Table of Contents
Fetching ...

Contrastive Learning Using Graph Embeddings for Domain Adaptation of Language Models in the Process Industry

Anastasia Zhukova, Jonas Lührs, Christian E. Lobmüller, Bela Gipp

TL;DR

This work addresses domain adaptation for German process-industry language models by leveraging graph-aware contrastive learning through graph embeddings (GE) derived from a heterogeneous knowledge graph. It adapts SciNCL to a domain with text logs and functional locations, using GE-based triplets to fine-tune LMs and improve semantic search on the PITEB benchmark. The key contribution lies in building a heterogeneous KG, initializing GE with text- and FL-descriptions, and sampling document triplets via GE neighborhoods, enabling effective, data-efficient fine-tuning that outperforms strong baselines (e.g., mE5-large) while using roughly one-third the parameters. The approach yields a 14.3% relative improvement over mE5-large and a 1.5% gain over M3 on PITEB with 2.45M training pairs, demonstrating practical impact for cost-efficient deployment in production settings and motivating further integration of domain graphs into LM fine-tuning.

Abstract

Recent trends in NLP utilize knowledge graphs (KGs) to enhance pretrained language models by incorporating additional knowledge from the graph structures to learn domain-specific terminology or relationships between documents that might otherwise be overlooked. This paper explores how SciNCL, a graph-aware neighborhood contrastive learning methodology originally designed for scientific publications, can be applied to the process industry domain, where text logs contain crucial information about daily operations and are often structured as sparse KGs. Our experiments demonstrate that language models fine-tuned with triplets derived from graph embeddings (GE) outperform a state-of-the-art mE5-large text encoder by 9.8-14.3% (5.45-7.96p) on the proprietary process industry text embedding benchmark (PITEB) while having 3 times fewer parameters.

Contrastive Learning Using Graph Embeddings for Domain Adaptation of Language Models in the Process Industry

TL;DR

This work addresses domain adaptation for German process-industry language models by leveraging graph-aware contrastive learning through graph embeddings (GE) derived from a heterogeneous knowledge graph. It adapts SciNCL to a domain with text logs and functional locations, using GE-based triplets to fine-tune LMs and improve semantic search on the PITEB benchmark. The key contribution lies in building a heterogeneous KG, initializing GE with text- and FL-descriptions, and sampling document triplets via GE neighborhoods, enabling effective, data-efficient fine-tuning that outperforms strong baselines (e.g., mE5-large) while using roughly one-third the parameters. The approach yields a 14.3% relative improvement over mE5-large and a 1.5% gain over M3 on PITEB with 2.45M training pairs, demonstrating practical impact for cost-efficient deployment in production settings and motivating further integration of domain graphs into LM fine-tuning.

Abstract

Recent trends in NLP utilize knowledge graphs (KGs) to enhance pretrained language models by incorporating additional knowledge from the graph structures to learn domain-specific terminology or relationships between documents that might otherwise be overlooked. This paper explores how SciNCL, a graph-aware neighborhood contrastive learning methodology originally designed for scientific publications, can be applied to the process industry domain, where text logs contain crucial information about daily operations and are often structured as sparse KGs. Our experiments demonstrate that language models fine-tuned with triplets derived from graph embeddings (GE) outperform a state-of-the-art mE5-large text encoder by 9.8-14.3% (5.45-7.96p) on the proprietary process industry text embedding benchmark (PITEB) while having 3 times fewer parameters.

Paper Structure

This paper contains 27 sections, 1 equation, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Graph embeddings are obtained from a directed heterogeneous domain graph for the process industry with two node types (1) text log (TL), i.e., logs of the daily operations at a production plant, and (2) functional locations (FL), i.e., hierarchically structured machinery on a production plant, and three edge types: related_to (green) connects two text logs, reports_about (black) links a text log to FLs, and part_of (blue) represents the hierarchical structure among FLs.
  • Figure 2: The methodology of adapting SciNCL ostendorff_neighborhood_2022 to a semantic search in the domain of the process industry. The two main changes involve generating document triplets using graph embeddings (GE) constructed from a heterogeneous knowledge graph (KG) and using these triplets as a source for query-document triplet generation during bi-encoder fine-tuning.
  • Figure 3: The best-performing fine-tuned mGBERT outperformed the strongest baseline M3 chen-etal-2024-m3 in almost all plants, the data from which was used for the fine-tuning (i.e., A, C, D, G).