Table of Contents
Fetching ...

Benchmark for Assessing Olfactory Perception of Large Language Models

Eftychia Makri, Nikolaos Nakis, Laura Sisson, Gigi Minsky, Leandros Tassiulas, Vahid Satarifard, Nicholas A. Christakis

Abstract

Here we introduce the Olfactory Perception (OP) benchmark, designed to assess the capability of large language models (LLMs) to reason about smell. The benchmark contains 1,010 questions across eight task categories spanning odor classification, odor primary descriptor identification, intensity and pleasantness judgments, multi-descriptor prediction, mixture similarity, olfactory receptor activation, and smell identification from real-world odor sources. Each question is presented in two prompt formats, compound names and isomeric SMILES, to evaluate the effect of molecular representations. Evaluating 21 model configurations across major model families, we find that compound-name prompts consistently outperform isomeric SMILES, with gains ranging from +2.4 to +18.9 percentage points (mean approx +7 points), suggesting current LLMs access olfactory knowledge primarily through lexical associations rather than structural molecular reasoning. The best-performing model reaches 64.4\% overall accuracy, which highlights both emerging capabilities and substantial remaining gaps in olfactory reasoning. We further evaluate a subset of the OP across 21 languages and find that aggregating predictions across languages improves olfactory prediction, with AUROC = 0.86 for the best performing language ensemble model. LLMs should be able to handle olfactory and not just visual or aural information.

Benchmark for Assessing Olfactory Perception of Large Language Models

Abstract

Here we introduce the Olfactory Perception (OP) benchmark, designed to assess the capability of large language models (LLMs) to reason about smell. The benchmark contains 1,010 questions across eight task categories spanning odor classification, odor primary descriptor identification, intensity and pleasantness judgments, multi-descriptor prediction, mixture similarity, olfactory receptor activation, and smell identification from real-world odor sources. Each question is presented in two prompt formats, compound names and isomeric SMILES, to evaluate the effect of molecular representations. Evaluating 21 model configurations across major model families, we find that compound-name prompts consistently outperform isomeric SMILES, with gains ranging from +2.4 to +18.9 percentage points (mean approx +7 points), suggesting current LLMs access olfactory knowledge primarily through lexical associations rather than structural molecular reasoning. The best-performing model reaches 64.4\% overall accuracy, which highlights both emerging capabilities and substantial remaining gaps in olfactory reasoning. We further evaluate a subset of the OP across 21 languages and find that aggregating predictions across languages improves olfactory prediction, with AUROC = 0.86 for the best performing language ensemble model. LLMs should be able to handle olfactory and not just visual or aural information.

Paper Structure

This paper contains 42 sections, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Olfactory Perception (OP) benchmark task description and output format across tasks.
  • Figure 2: Olfactory Perception (OP) benchmark performance(a) Two bars per model show overall mean accuracy across all categories for isomeric SMILES and compound name prompts. Multi‑answer categories use multilabel F1; all others use any‑overlap accuracy. Error bars indicate 95% confidence intervals of the cross‑category mean. Models are grouped by LLM family with shaded bands and dashed boundaries; labels use shortened model names. (b) Eight subplots show per‑category accuracy for each model, with separate isomeric SMILES and compound‑name prompts. Horizontal error bars are bootstrap standard deviations of the mean accuracy. Each panel includes a category‑specific chance baseline shown by gray dashed line.
  • Figure 3: Predictions correlations and performance difficulty across olfactory tasks.(a) Pearson correlations across models for three continuous‑rating categories (odor intensity, odor pleasantness, and mixture similarity). Each row shows the distribution of model correlations for isomeric SMILES and Compound name prompts; gray points represent individual models, and the best‑performing model per prompt/category is highlighted with a colored circle. State‑of‑the‑art reference performance is indicated by red dashed lines. (b) Combined label‑difficulty ranking for RATA (blue) and olfactory‑receptor activation (red). For each label, per‑model F1 scores are computed in a multilabel setting; the bar shows the mean F1 across models with an error bar (standard deviation), while points represent individual model values.
  • Figure 4: Multilingual RATA performance for compound name prompt.(a) Per-language distributions of model performance, summarized as mean per-question multilabel F1, colored points denote individual models and the black star denotes a cross-model ensemble. (b) DeepSeek (32K) AUROC by language family using per-label vote fractions across languages within each family (East Asian ans Others are not same language families), black curve represent an ensemble of all languages. (c) AUROC for each model using an all-language majority vote ensemble, plus a cross-model ensemble pooling all models from all languages. Dashed diagonal indicates chance.
  • Figure 5: Question difficulty distribution. Each ridge shows the density of questions at a given difficulty level, measured as the percentage of the 21 evaluated models answering correctly (compound name prompts); ridge heights are proportional to the number of questions per category and annotated numbers indicate bin counts. OC clusters at the right (easy), OS at the left (hard), and OIn/OPl exhibit bimodal distributions with mass at both extremes
  • ...and 5 more figures