Table of Contents
Fetching ...

Implementing NLPs in industrial process modeling: Addressing Categorical Variables

Eleni D. Koronaki, Geremy Loachamin Suntaxi, Paris Papavasileiou, Dimitrios G. Giovanis, Martin Kathrein, Andreas G. Boudouvis, Stéphane P. A. Bordas

TL;DR

This work tackles encoding categorical variables in industrial process modeling by using NLP-derived embeddings to preserve semantic distances between categories. It compares Doc2Vec and pretrained transformer embeddings (all-MiniLM-L12-v2, all-mpnet-base-v2) combined with PCA/UMAP dimensionality reduction and an XGBoost regressor. The study uses real production data from a chemical vapor deposition coating process and applies SHAP and total_gain feature-importance analyses to interpret predictors. The results indicate embeddings can improve interpretability and enable sensitivity analysis and uncertainty quantification, with implications for data-driven process optimization beyond the case study.

Abstract

Important variables of processes are often categorical, i.e. names or labels representing, e.g. categories of inputs, or types of reactors or a sequence of steps. In this work, we use Natural Language Processing Models to derive embeddings of such inputs that represent their actual meaning, or reflect the "distances" between categories, i.e. how similar or dissimilar they are. This is a marked difference from the current standard practice of using binary, or one-hot encoding to replace categorical variables with sequences of ones and zeros. Combined with dimensionality reduction techniques, either linear such as Principal Component Analysis, or nonlinear such as Uniform Manifold Approximation and Projection, the proposed approach leads to a meaningful, low-dimensional feature space. The significance of obtaining meaningful embeddings is illustrated in the context of an industrial coating process for cutting tools that includes both numerical and categorical inputs. In this industrial process, subject matter expertise suggests that the categorical inputs are critical for determining the final outcome but this cannot be taken into account with the current state-of-the-art. The proposed approach enables feature importance which is a marked improvement compared to the current state-of-the-art in the encoding of categorical variables. The proposed approach is not limited to the case-study presented here and is suitable for applications with similar mix of categorical and numerical critical inputs.

Implementing NLPs in industrial process modeling: Addressing Categorical Variables

TL;DR

This work tackles encoding categorical variables in industrial process modeling by using NLP-derived embeddings to preserve semantic distances between categories. It compares Doc2Vec and pretrained transformer embeddings (all-MiniLM-L12-v2, all-mpnet-base-v2) combined with PCA/UMAP dimensionality reduction and an XGBoost regressor. The study uses real production data from a chemical vapor deposition coating process and applies SHAP and total_gain feature-importance analyses to interpret predictors. The results indicate embeddings can improve interpretability and enable sensitivity analysis and uncertainty quantification, with implications for data-driven process optimization beyond the case study.

Abstract

Important variables of processes are often categorical, i.e. names or labels representing, e.g. categories of inputs, or types of reactors or a sequence of steps. In this work, we use Natural Language Processing Models to derive embeddings of such inputs that represent their actual meaning, or reflect the "distances" between categories, i.e. how similar or dissimilar they are. This is a marked difference from the current standard practice of using binary, or one-hot encoding to replace categorical variables with sequences of ones and zeros. Combined with dimensionality reduction techniques, either linear such as Principal Component Analysis, or nonlinear such as Uniform Manifold Approximation and Projection, the proposed approach leads to a meaningful, low-dimensional feature space. The significance of obtaining meaningful embeddings is illustrated in the context of an industrial coating process for cutting tools that includes both numerical and categorical inputs. In this industrial process, subject matter expertise suggests that the categorical inputs are critical for determining the final outcome but this cannot be taken into account with the current state-of-the-art. The proposed approach enables feature importance which is a marked improvement compared to the current state-of-the-art in the encoding of categorical variables. The proposed approach is not limited to the case-study presented here and is suitable for applications with similar mix of categorical and numerical critical inputs.
Paper Structure (25 sections, 1 equation, 16 figures, 6 tables)

This paper contains 25 sections, 1 equation, 16 figures, 6 tables.

Figures (16)

  • Figure 1: (a) Indicative geometries of the coated cutting tools. (b) A 3D representation of a 3-disk part of the reactor. The inlet perforations on the rotating inlet tube are shown in red. The outlet perforations for each disk are shown in blue.
  • Figure 2: Positions with available $\alpha$-Al2O3 thickness values from the production data for our test case. In general, across different production runs, the R position (the one closest to the reactor outlet) is the one with the highest amount of data. For this reason, the ML models are trained to make predictions for inserts placed in this position.
  • Figure 3: Example of an ISO designation for indexable inserts.
  • Figure 4: Cosine similarity is defined as the cosine of the angle between the embedding vectors of two objects. Using this concept, we consider two insert shapes (reference and candidate) that are embedded into a feature space, where their pairwise cosine similarity is computed. The similarity matrix on the right contains all the computed values and illustrates the relationships between different insert shape embeddings.
  • Figure 5: Heatmap showing the cosine similarity values calculated from the dense vectors obtained using the Doc2Vec model. These values range from -1 to 1, where values closer to 0 are represented in lighter shades, and values closer to -1 or 1 are colored using darker shades. The value enclosed in a yellow box represents the high similarity between rhombus-shaped objects; in contrast, the low values enclosed in the green and blue boxes suggest lower similarity between the respective shapes.
  • ...and 11 more figures