Table of Contents
Fetching ...

From Molecules to Mixtures: Learning Representations of Olfactory Mixture Similarity using Inductive Biases

Gary Tom, Cher Tian Ser, Ella M. Rajaonson, Stanley Lo, Hyun Suk Park, Brian K. Lee, Benjamin Sanchez-Lengeling

TL;DR

POMMix extends the Principal Odor Map to olfactory mixtures by combining a mono-molecular GraphNets-based representation with a permutation-invariant mixture attention module and a distance-based prediction head. The two-stage training regime leverages a mono-molecular pretraining phase and a mixture-focused fine-tuning phase, achieving state-of-the-art predictive performance on mixture similarity across public datasets and demonstrating generalization to unseen molecules and larger mixture sizes. The work highlights the power of incorporating domain-specific inductive biases in low-data olfactory domains, offers interpretable insights via attention maps, and provides fully open data and code to drive reproducibility. Collectively, POMMix advances the digitization of olfaction and has potential implications for fragrance design, flavor science, and related multi-component sensing tasks.

Abstract

Olfaction -- how molecules are perceived as odors to humans -- remains poorly understood. Recently, the principal odor map (POM) was introduced to digitize the olfactory properties of single compounds. However, smells in real life are not pure single molecules, but complex mixtures of molecules, whose representations remain relatively under-explored. In this work, we introduce POMMix, an extension of the POM to represent mixtures. Our representation builds upon the symmetries of the problem space in a hierarchical manner: (1) graph neural networks for building molecular embeddings, (2) attention mechanisms for aggregating molecular representations into mixture representations, and (3) cosine prediction heads to encode olfactory perceptual distance in the mixture embedding space. POMMix achieves state-of-the-art predictive performance across multiple datasets. We also evaluate the generalizability of the representation on multiple splits when applied to unseen molecules and mixture sizes. Our work advances the effort to digitize olfaction, and highlights the synergy of domain expertise and deep learning in crafting expressive representations in low-data regimes.

From Molecules to Mixtures: Learning Representations of Olfactory Mixture Similarity using Inductive Biases

TL;DR

POMMix extends the Principal Odor Map to olfactory mixtures by combining a mono-molecular GraphNets-based representation with a permutation-invariant mixture attention module and a distance-based prediction head. The two-stage training regime leverages a mono-molecular pretraining phase and a mixture-focused fine-tuning phase, achieving state-of-the-art predictive performance on mixture similarity across public datasets and demonstrating generalization to unseen molecules and larger mixture sizes. The work highlights the power of incorporating domain-specific inductive biases in low-data olfactory domains, offers interpretable insights via attention maps, and provides fully open data and code to drive reproducibility. Collectively, POMMix advances the digitization of olfaction and has potential implications for fragrance design, flavor science, and related multi-component sensing tasks.

Abstract

Olfaction -- how molecules are perceived as odors to humans -- remains poorly understood. Recently, the principal odor map (POM) was introduced to digitize the olfactory properties of single compounds. However, smells in real life are not pure single molecules, but complex mixtures of molecules, whose representations remain relatively under-explored. In this work, we introduce POMMix, an extension of the POM to represent mixtures. Our representation builds upon the symmetries of the problem space in a hierarchical manner: (1) graph neural networks for building molecular embeddings, (2) attention mechanisms for aggregating molecular representations into mixture representations, and (3) cosine prediction heads to encode olfactory perceptual distance in the mixture embedding space. POMMix achieves state-of-the-art predictive performance across multiple datasets. We also evaluate the generalizability of the representation on multiple splits when applied to unseen molecules and mixture sizes. Our work advances the effort to digitize olfaction, and highlights the synergy of domain expertise and deep learning in crafting expressive representations in low-data regimes.

Paper Structure

This paper contains 25 sections, 1 equation, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Task schematic. Data collection process for olfactory mixture similarities (left), and our approach to predicting olfactory mixture similarities (right).
  • Figure 2: Snitz, Ravia, and Bushdid mixture datasets at a glance. a) Most mixtures contain 4-30 molecules, with a handful of single-molecule data as a measurement baseline. b) Most mixtures are somewhat different (0.4-0.8 averaged human response), with a smaller number of outright dissimilar measurements. c) Standard RDKit cheminformatics molecule features, aggregated across the mixture with mean, standard deviation, minimum, and maximum (as described in Soelch2019-qkCorso2020-ei) correlate poorly with perceptual similarity, while d) POMMix embeddings are carefully tuned for the task of discriminating mixture percepts. Pearson $\rho$ correlation constants are annotated in inset. Across all four subplots, color labels indicate the dataset source.
  • Figure 3: The POMMix model combines POM with mixture modeling. (Top) The POM model with a generalized linear model (GLM) is pre-trained with mono-molecular olfactory data, and mixture modeling is performed through the CheMix attention model. (Middle) The two modules are joined to produce mixture embeddings which are trained to encode the olfactory perceptual distance of two mixtures using a scaled cosine distance predictor head. (Bottom) A multi-step model fitting procedure is used, where certain model weights are updated (flame) and other pre-trained model weights are frozen (snowflake).
  • Figure 4: Model performances on mixture dataset. Pearson $\rho$, RMSE, and Kendall $\tau$ for all baselines and models evaluated. Model complexity increases from top to bottom. Parity plots available in Appendix \ref{['sec:predictive-parity']}.
  • Figure 5: Generalization to new mixture sizes and molecules. a) Ablation study with training data only containing mixtures with geometric average number of molecules less than a threshold. The thresholds are indicated for each split. b) Boxplot of POMMix test Pearson correlation on random CV splits, and the LMO splits.
  • ...and 7 more figures