Table of Contents
Fetching ...

Machine learning for smell: Ordinal odor strength prediction of molecular perfumery components

Peter Fichtelmann, Julia Westermayr

TL;DR

This work tackles predicting odor strength, an important descriptor for fragrance design, by constructing an ordinal dataset of over 2,000 molecules from Good Scents and PubChem. It evaluates diverse molecular representations, including RDKit descriptors, Morgan/topological fingerprints, and pretrained encoders, under two learning schemes: a direct four-class ordinal predictor and a two-step odorous/non-odorous then strength pipeline. The best-performing model is a multilayer perceptron trained on 217 RDKit descriptors, achieving a macro MSE of approximately $0.53$ and $R^2$ of roughly $0.57$ on hold-out data, and generalizes to independent test molecules. SHAP analysis links polarity, molecular weight/size, ring features, and branching to odor-strength predictions, aligning with mass-transport constraints and offering mechanistic insight to enable in silico fragrance design.

Abstract

Predicting olfactory perception directly from molecular structure is central to fragrance design that plays a role in a wide range of industries, such as perfumery, food and beverage, and health care. Among olfactory attributes, odor strength is a key factor in shaping odor perception, but its modeling has been impeded by scarce and fragmented intensity data. In this work, we introduce an ordinal odor strength data set of over 2,000 molecules by integrating two different public sources, mapping structures to odorless, low, medium, and high categories. Across several molecular encodings and supervised learning algorithms we compared different prediction strategies. Dimensionality reduction and SHAP analysis identifies molecular size, polarity, ring features, and branching as primary drivers, consistent with mass-transport constraints on volatility, sorption, and receptor access. This scalable ordinal framework enables reliable odor-strength estimation for novel molecules and provides a foundation for in silico fragrance design.

Machine learning for smell: Ordinal odor strength prediction of molecular perfumery components

TL;DR

This work tackles predicting odor strength, an important descriptor for fragrance design, by constructing an ordinal dataset of over 2,000 molecules from Good Scents and PubChem. It evaluates diverse molecular representations, including RDKit descriptors, Morgan/topological fingerprints, and pretrained encoders, under two learning schemes: a direct four-class ordinal predictor and a two-step odorous/non-odorous then strength pipeline. The best-performing model is a multilayer perceptron trained on 217 RDKit descriptors, achieving a macro MSE of approximately and of roughly on hold-out data, and generalizes to independent test molecules. SHAP analysis links polarity, molecular weight/size, ring features, and branching to odor-strength predictions, aligning with mass-transport constraints and offering mechanistic insight to enable in silico fragrance design.

Abstract

Predicting olfactory perception directly from molecular structure is central to fragrance design that plays a role in a wide range of industries, such as perfumery, food and beverage, and health care. Among olfactory attributes, odor strength is a key factor in shaping odor perception, but its modeling has been impeded by scarce and fragmented intensity data. In this work, we introduce an ordinal odor strength data set of over 2,000 molecules by integrating two different public sources, mapping structures to odorless, low, medium, and high categories. Across several molecular encodings and supervised learning algorithms we compared different prediction strategies. Dimensionality reduction and SHAP analysis identifies molecular size, polarity, ring features, and branching as primary drivers, consistent with mass-transport constraints on volatility, sorption, and receptor access. This scalable ordinal framework enables reliable odor-strength estimation for novel molecules and provides a foundation for in silico fragrance design.

Paper Structure

This paper contains 5 sections, 1 equation, 4 figures.

Figures (4)

  • Figure 1: Schematic representation of the mass-transport mechanism for molecules to be olfactory stimuli. The odorant has to evaporate, enter the nose, reach the olfactory epithelium, adsorb into the olfactory mucosa, enter olfactory receptor binding pockets, and activate an olfactory receptor neuron. Thus, the chemical space of potential odorous compounds is restricted by volatility and polarity constraints.
  • Figure 2: Overview of the developed method. A data set of the ordinal odor strength of more than 2000 compounds was compiled from Good Scents and PubChem. Ordinal regression with a range of state-of-the-art machine learning algorithms was performed to categorize substances by their odor strength.
  • Figure 3: Data set representations. (a) Amount of data for each odor strength. Each square corresponds to about 40 data instances. For each odor strength category, an example molecule is shown. The table with values is shown in the SI in Table S1. (b) 2D PCA of the RDKit descriptors of our curated data set colored by their odor strength and the odorous background data set consisting of 52457 molecules (grey) obtained from a downsample of the GDB-17 database ruddigkeit_2012 with a predicted odor probability of 50% or more according to the best-performing model from Mayhew et al.mayhew_2022.
  • Figure 4: Best-performing model performance for the direct prediction approach. (a) Macro averaged mean squared error (MSE) across odor strength categories in the validation sets obtained from cross-validation of the best models for all combinations of molecule encoders (bottom) and predictors (left). MLP is multi-layer-perceptron and FP fingerprint. (b) Confusion matrix normed by the number of test samples for the best-performing model on the test set, averaged over 10 random-seeded training runs. (c) Area-normed violin plots of the best-performing model predictions for novel molecules from Keller et al.keller_2017 compared with their experimentally rated odor intensities (from 0 to 100; 13-108 per molecule) at 1/1000 dilution. (d) Global SHAP (SHapley Additive exPlanations)lundberg2017unified feature importance of the most influential feature groups of the best-performing model. The RDKit descriptor features were grouped using agglomerative clustering based on their feature value correlation (threshold: 0.75). The absolute SHAP values within each group were summed.