Table of Contents
Fetching ...

New York Smells: A Large Multimodal Dataset for Olfaction

Ege Ozguroglu, Junbang Liang, Ruoshi Liu, Mia Chiquier, Michael DeTienne, Wesley Wei Qian, Alexandra Horowitz, Andrew Owens, Carl Vondrick

TL;DR

New York Smells tackles the lack of natural multimodal olfactory data by collecting a large, in-the-wild dataset pairing 7K smell–image samples with rich sensor readings in NYC. The authors train a contrastive, multimodal representation (COIP) to align smell and vision using two input signals (raw $T\times32$ e-nose data and a 32‑D smellprint) and evaluate on cross-modal retrieval, smell-based scene/object/material recognition, and fine-grained grass classification. Results show vision supervision enables robust olfactory representations and that end-to-end learning on raw olfactory signals outperforms traditional hand-crafted features, enabling meaningful cross-modal and semantic understanding of odors in real-world settings. The work advances computational olfaction and opens directions for in-the-wild, cross-modal sensing with practical implications for environment understanding and scent-based applications.

Abstract

While olfaction is central to how animals perceive the world, this rich chemical sensory modality remains largely inaccessible to machines. One key bottleneck is the lack of diverse, multimodal olfactory training data collected in natural settings. We present New York Smells, a large dataset of paired image and olfactory signals captured ``in the wild.'' Our dataset contains 7,000 smell-image pairs from 3,500 distinct objects across indoor and outdoor environments, with approximately 70$\times$ more objects than existing olfactory datasets. Our benchmark has three tasks: cross-modal smell-to-image retrieval, recognizing scenes, objects, and materials from smell alone, and fine-grained discrimination between grass species. Through experiments on our dataset, we find that visual data enables cross-modal olfactory representation learning, and that our learned olfactory representations outperform widely-used hand-crafted features.

New York Smells: A Large Multimodal Dataset for Olfaction

TL;DR

New York Smells tackles the lack of natural multimodal olfactory data by collecting a large, in-the-wild dataset pairing 7K smell–image samples with rich sensor readings in NYC. The authors train a contrastive, multimodal representation (COIP) to align smell and vision using two input signals (raw e-nose data and a 32‑D smellprint) and evaluate on cross-modal retrieval, smell-based scene/object/material recognition, and fine-grained grass classification. Results show vision supervision enables robust olfactory representations and that end-to-end learning on raw olfactory signals outperforms traditional hand-crafted features, enabling meaningful cross-modal and semantic understanding of odors in real-world settings. The work advances computational olfaction and opens directions for in-the-wild, cross-modal sensing with practical implications for environment understanding and scent-based applications.

Abstract

While olfaction is central to how animals perceive the world, this rich chemical sensory modality remains largely inaccessible to machines. One key bottleneck is the lack of diverse, multimodal olfactory training data collected in natural settings. We present New York Smells, a large dataset of paired image and olfactory signals captured ``in the wild.'' Our dataset contains 7,000 smell-image pairs from 3,500 distinct objects across indoor and outdoor environments, with approximately 70 more objects than existing olfactory datasets. Our benchmark has three tasks: cross-modal smell-to-image retrieval, recognizing scenes, objects, and materials from smell alone, and fine-grained discrimination between grass species. Through experiments on our dataset, we find that visual data enables cross-modal olfactory representation learning, and that our learned olfactory representations outperform widely-used hand-crafted features.

Paper Structure

This paper contains 31 sections, 3 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Multimodal olfaction in-the-wild. (a) We present New York Smells: a diverse, multimodal dataset of natural olfactory signals and paired visual data. We show one sequence of images and smell signals that we obtained in a public park (one scene of many in our dataset). We use this dataset for in-the-wild multimodal olfactory learning tasks that were not possible with previous datasets: (b) learning cross-modal features between olfaction and images, (c) retrieving images based on their corresponding olfactory signals, (d) recognizing in-the-wild scene, object, and material categories from smell, (e) distinguishing different grass species.
  • Figure 2: The New York Smells dataset. We collect a diverse dataset of paired sight and olfaction by visiting many locations within New York City and recorded a variety of materials (top rows) and objects (bottom rows) in different scenes. We show a selection of the captured images here. All samples have a corresponding olfactory signal captured from the Cyranose electronic nose.
  • Figure 3: Odorant analysis. We show the distribution of objects and materials in our dataset. We use these labels to define smell understanding benchmarks.
  • Figure 4: Capturing paired sight and olfaction. We walk through a variety of real-world scenes and capture paired olfaction and visual signals using a camera mounted to an e-nose on a custom 3D-printed sensor rig. We point the e-nose's snout at each object or substance of interest and record multiple images and smell signals from different orientations. We also capture a suite of other supplementary modalities: depth (from an RGB-D camera), temperature, humidity, and ambient VOC concentrations (from a PID sensor).
  • Figure 5: Olfactory signal: The raw smell signal is $T \times 32$ dimensions where $T$ is the capture time. The first part of capture is the baseline phase, where the ambient background smell is sensed. The second part is the sample phase, where the smell of the object of interest is sensed. This example shows the response for a flower.
  • ...and 3 more figures