Table of Contents
Fetching ...

Olfactory Label Prediction on Aroma-Chemical Pairs

Laura Sisson, Aryan Amit Barsainyan, Mrityunjay Sharma, Ritesh Kumar

TL;DR

This work addresses the challenge of predicting olfactory descriptors for blends of aroma-chemicals using graph neural networks. It introduces a labeled blended-pair dataset and compares two architectures, a Graph Isomorphism Network (GIN-GNN) and a Message Passing Neural Network (MPNN-GNN), including a graph-carving strategy to ensure robust train/test separation. The MPNN-GNN achieves a mean AUROC of about $0.77$ on blended pairs and $0.89$ on single-molecule prediction, with the GIN-GNN close behind on blends, indicating strong transferability between blend and single-molecule tasks. Embedding-space analyses reveal non-linear blending and selective contributions from constituent molecules, and the authors provide public code to encourage further exploration and data augmentation in this domain.

Abstract

The application of deep learning techniques on aroma-chemicals has resulted in models more accurate than human experts at predicting olfactory qualities. However, public research in this domain has been limited to predicting the qualities of single molecules, whereas in industry applications, perfumers and food scientists are often concerned with blends of many molecules. In this paper, we apply both existing and novel approaches to a dataset we gathered consisting of labeled pairs of molecules. We present graph neural network models capable of accurately predicting the odor qualities arising from blends of aroma-chemicals, with an analysis of how variations in architecture can lead to significant differences in predictive power.

Olfactory Label Prediction on Aroma-Chemical Pairs

TL;DR

This work addresses the challenge of predicting olfactory descriptors for blends of aroma-chemicals using graph neural networks. It introduces a labeled blended-pair dataset and compares two architectures, a Graph Isomorphism Network (GIN-GNN) and a Message Passing Neural Network (MPNN-GNN), including a graph-carving strategy to ensure robust train/test separation. The MPNN-GNN achieves a mean AUROC of about on blended pairs and on single-molecule prediction, with the GIN-GNN close behind on blends, indicating strong transferability between blend and single-molecule tasks. Embedding-space analyses reveal non-linear blending and selective contributions from constituent molecules, and the authors provide public code to encourage further exploration and data augmentation in this domain.

Abstract

The application of deep learning techniques on aroma-chemicals has resulted in models more accurate than human experts at predicting olfactory qualities. However, public research in this domain has been limited to predicting the qualities of single molecules, whereas in industry applications, perfumers and food scientists are often concerned with blends of many molecules. In this paper, we apply both existing and novel approaches to a dataset we gathered consisting of labeled pairs of molecules. We present graph neural network models capable of accurately predicting the odor qualities arising from blends of aroma-chemicals, with an analysis of how variations in architecture can lead to significant differences in predictive power.
Paper Structure (27 sections, 1 equation, 6 figures, 7 tables)

This paper contains 27 sections, 1 equation, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Methodology (a and b) Non-linear relationship between the qualities of constituent aroma-chemicals and the overall blend. Although the same molecules appear across the single and mixture datasets, emergent notes appear when molecules are combine, and other notes become muted in the blend. (c) Sample of the densest region of the blended pair meta-graph. Here, 0.5% of the total meta-graph nodes, consisting of 7 train molecules (in blue) and 7 test molecules (in red) are visualized, with an average degree of 6. Because there are many data points/edges per molecule, the meta-graph is dense and thus difficult to carve. (d) Graph carving schematic. The carving algorithm aimed to maximize the number of usable pairs without causing distributional shift in labels. (e & f) Schema of experiment. The entire (e) optimization and training pipeline used in this paper, including (f) 5-fold splits of 50:25:25 train/test/validate splits used for hyper-parameter optimization. (g) GNN predictions on single molecules. Message passing layers are applied across the molecular graph, and then followed by a readout phase and a multi-layer perceptron (MLP) in order to predict the final label. (h) MPNN-GNN predictions on blended pairs, the molecular graphs are treated as one graph, with a combined readout and MLP as above. (i) GIN-GNN predictions on blended pairs. The molecule graphs have separate message passing and readout steps, and are combined only before the MLP.
  • Figure 2: Predictive power of our GNN models and the Morgan Fingerprint baseline across all labels with random baseline (dashed line). (a) Blended pair task AUROC scores, by descriptor. (b) Single molecule task AUROC scores, by descriptor.
  • Figure 3: Analysis of Odour labels by conducting experiments. (a) KDE plots for top 5 descriptors by predictive accuracy in the training set of single molecule task. (b) KDE top 5 single molecule as above, for test set. (c) KDE plots for bottom 5 descriptors by predictive accuracy in the training set of single molecule task. (d) KDE bottom 5 single molecule as above, for test set. (e) KDE plots for top 5 descriptors by predictive accuracy in the training set of blended pair prediction task. (f) KDE top 5 blended pair as above, for test set. (g) KDE plots for bottom 5 descriptors by predictive accuracy in the training set of blended pair task. (h) KDE bottom 5 blended pair as above, for test set.
  • Figure 4: Scatter-plots of fit-coefficients for predicting the blended pair's embedding using the GNN embeddings. (a) Scatter-plot of the fit coefficient using the MPNN-GNN model from, with zoom on centroid. Across all pairs, the average $r^2$ is $0.47$ and the $p$-value for the $F$-statistic is 4.68e-5. Notably, the distribution is not centered on the origin. In some cases, the blended pair's embedding consists of equal combinations of each individual embedding, while in other cases, one particular embedding predominates. (b) Scatter-plot, as above, using the GIN-GNN embeddings. The correlation between single molecule and blended pair embeddings was weaker, with average $r^2$ of $0.021$ and with a $p$-value for the $F$-statistic is $.445$. The distribution is centered on the origin, suggesting that for many points, neither molecule's embedding factor into the pair embedding. The vertical and horizontal lines represent where one component predominates, but the other molecule is not factored in at all.
  • Figure S1: Combined scatter-plot of the regressors for both the MPNN-GNN and the GIN-GNN.
  • ...and 1 more figures