SommBench: Assessing Sommelier Expertise of Language Models

William Brach; Tomas Bedej; Jacob Nielsen; Jacob Pichna; Juraj Bedej; Eemeli Saarensilta; Julie Dupouy; Gianluca Barmina; Andrea Blasi Núñez; Peter Schneider-Kamp; Kristian Košťál; Michal Ries; Lukas Galke Poech

SommBench: Assessing Sommelier Expertise of Language Models

William Brach, Tomas Bedej, Jacob Nielsen, Jacob Pichna, Juraj Bedej, Eemeli Saarensilta, Julie Dupouy, Gianluca Barmina, Andrea Blasi Núñez, Peter Schneider-Kamp, Kristian Košťál, Michal Ries, Lukas Galke Poech

Abstract

With the rapid advances of large language models, it becomes increasingly important to systematically evaluate their multilingual and multicultural capabilities. Previous cultural evaluation benchmarks focus mainly on basic cultural knowledge that can be encoded in linguistic form. Here, we propose SommBench, a multilingual benchmark to assess sommelier expertise, a domain deeply grounded in the senses of smell and taste. While language models learn about sensory properties exclusively through textual descriptions, SommBench tests whether this textual grounding is sufficient to emulate expert-level sensory judgment. SommBench comprises three main tasks: Wine Theory Question Answering (WTQA), Wine Feature Completion (WFC), and Food-Wine Pairing (FWP). SommBench is available in multiple languages: English, Slovak, Swedish, Finnish, German, Danish, Italian, and Spanish. This helps separate a language model's wine expertise from its language skills. The benchmark datasets were developed in close collaboration with a professional sommelier and native speakers of the respective languages, resulting in 1,024 wine theory question-answering questions, 1,000 wine feature-completion examples, and 1,000 food-wine pairing examples. We provide results for the most popular language models, including closed-weights models such as Gemini 2.5, and open-weights models, such as GPT-OSS and Qwen 3. Our results show that the most capable models perform well on wine theory question answering (up to 97% correct with a closed-weights model), yet feature completion (peaking at 65%) and food-wine pairing show (MCC ranging between 0 and 0.39) turn out to be more challenging. These results position SommBench as an interesting and challenging benchmark for evaluating the sommelier expertise of language models. The benchmark is publicly available at https://github.com/sommify/sommbench.

SommBench: Assessing Sommelier Expertise of Language Models

Abstract

Paper Structure (45 sections, 1 equation, 3 figures, 4 tables)

This paper contains 45 sections, 1 equation, 3 figures, 4 tables.

Introduction
Related work
SommBench
Wine Theory Question-Answering
Data Collection
Data Curation
Multilinguality
The Task
Evaluation
Wine Features Completion
Data Collection
Data Curation
Multilinguality
The Task
Evaluation
...and 30 more sections

Figures (3)

Figure 1: Performance of leading open and closed-source language models on the SommBench benchmark. The radar chart shows model accuracy (the higher, the better) in SommBench tasks revealing key differences in competencies between tasks and models.
Figure 2: Taxonomy of Sommelier expertise, categorized into knowledge, profiling, and food wine pairing.
Figure 3: Scaling behaviour. SommBench score (y-axis) against number of parameters in billions (x-axis). gemini-2.5-flash and gemini-2.5-pro are plotted without a specific x-coordinate, as their parameter counts are undisclosed; they are included as performance reference points only.

SommBench: Assessing Sommelier Expertise of Language Models

Abstract

SommBench: Assessing Sommelier Expertise of Language Models

Authors

Abstract

Table of Contents

Figures (3)