Table of Contents
Fetching ...

Learning to Taste: A Multimodal Wine Dataset

Thoranna Bender, Simon Moe Sørensen, Alireza Kashani, K. Eldjarn Hjorleifsson, Grethe Hyldig, Søren Hauberg, Serge Belongie, Frederik Warburg

TL;DR

WineSensed tackles the problem of grounding flavor in multimodal representations by combining wine label images, user reviews, and human flavor annotations. The authors introduce FEAST, a framework that aligns CLIP-based embeddings with human flavor similarities via NMDS and CCA to create a low-dimensional flavor space. Across coarse attribute prediction and fine-grained taste-space alignment, multi-modal inputs augmented with flavor annotations yield the strongest performance and strongest alignment with human perception. The dataset and method offer a resource for flavor-grounded foundation models and point to future expansion into broader wine types and additional modalities.

Abstract

We present WineSensed, a large multimodal wine dataset for studying the relations between visual perception, language, and flavor. The dataset encompasses 897k images of wine labels and 824k reviews of wines curated from the Vivino platform. It has over 350k unique bottlings, annotated with year, region, rating, alcohol percentage, price, and grape composition. We obtained fine-grained flavor annotations on a subset by conducting a wine-tasting experiment with 256 participants who were asked to rank wines based on their similarity in flavor, resulting in more than 5k pairwise flavor distances. We propose a low-dimensional concept embedding algorithm that combines human experience with automatic machine similarity kernels. We demonstrate that this shared concept embedding space improves upon separate embedding spaces for coarse flavor classification (alcohol percentage, country, grape, price, rating) and aligns with the intricate human perception of flavor.

Learning to Taste: A Multimodal Wine Dataset

TL;DR

WineSensed tackles the problem of grounding flavor in multimodal representations by combining wine label images, user reviews, and human flavor annotations. The authors introduce FEAST, a framework that aligns CLIP-based embeddings with human flavor similarities via NMDS and CCA to create a low-dimensional flavor space. Across coarse attribute prediction and fine-grained taste-space alignment, multi-modal inputs augmented with flavor annotations yield the strongest performance and strongest alignment with human perception. The dataset and method offer a resource for flavor-grounded foundation models and point to future expansion into broader wine types and additional modalities.

Abstract

We present WineSensed, a large multimodal wine dataset for studying the relations between visual perception, language, and flavor. The dataset encompasses 897k images of wine labels and 824k reviews of wines curated from the Vivino platform. It has over 350k unique bottlings, annotated with year, region, rating, alcohol percentage, price, and grape composition. We obtained fine-grained flavor annotations on a subset by conducting a wine-tasting experiment with 256 participants who were asked to rank wines based on their similarity in flavor, resulting in more than 5k pairwise flavor distances. We propose a low-dimensional concept embedding algorithm that combines human experience with automatic machine similarity kernels. We demonstrate that this shared concept embedding space improves upon separate embedding spaces for coarse flavor classification (alcohol percentage, country, grape, price, rating) and aligns with the intricate human perception of flavor.
Paper Structure (26 sections, 9 figures, 20 tables)

This paper contains 26 sections, 9 figures, 20 tables.

Figures (9)

  • Figure 1: Flavor as an additional data modality. The WineSensed dataset consists of a large collection of images, user reviews, and metadata about unique bottlings (upper left). In a large user study, we collected flavor annotations of over 100 wines using the "Napping" method pages2005collection, where participants were asked to place wines on a sheet of paper based on their perceived taste similarity (lower left). We propose an algorithm to combine these data modalities into a shared representation (right) and find that using taste annotations as an additional modality improves performance in downstream tasks.
  • Figure 2: Examples from WineSensed. The dataset consists of images of wine labels, user-generated reviews, per-wine attributes (country, grape, region, alcohol percentage, rating, price), and flavor annotations. Here are examples of the images, reviews, and attributes.
  • Figure 3: Examples of images. The viewpoint, lighting, and composition vary across images.
  • Figure 4: Summary statistics of user reviews and images. Most unique bottlings have less than 10 images. The average review length is 16 words. Common keywords in the reviews include 'fruit', 'dry', and 'smooth' revealing coarse semantic information about the flavor of the wines while other keywords such as 'good' and 'great' do not reveal flavor information.
  • Figure 5: Wine attributes. WineSensed contains attributes about the geolocation of production (country, region) and the grape composition of each wine. Furthermore, the dataset includes information on the average price of the wine, alcohol percentage, average rating on the Vivino platform, and the year of production. The histograms show the distribution of these attributes.
  • ...and 4 more figures