Table of Contents
Fetching ...

FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture

Wenyan Li, Xinyu Zhang, Jiaang Li, Qiwei Peng, Raphael Tang, Li Zhou, Weijia Zhang, Guimin Hu, Yifei Yuan, Anders Søgaard, Daniel Hershcovich, Desmond Elliott

TL;DR

This work introduces FoodieQA, a manually curated, fine-grained image-text dataset capturing the intricate features of food cultures across various regions in China, and evaluates vision–language Models (VLMs) and large language models (LLMs) on newly collected, unseen food images and corresponding questions.

Abstract

Food is a rich and varied dimension of cultural heritage, crucial to both individuals and social groups. To bridge the gap in the literature on the often-overlooked regional diversity in this domain, we introduce FoodieQA, a manually curated, fine-grained image-text dataset capturing the intricate features of food cultures across various regions in China. We evaluate vision-language Models (VLMs) and large language models (LLMs) on newly collected, unseen food images and corresponding questions. FoodieQA comprises three multiple-choice question-answering tasks where models need to answer questions based on multiple images, a single image, and text-only descriptions, respectively. While LLMs excel at text-based question answering, surpassing human accuracy, the open-sourced VLMs still fall short by 41% on multi-image and 21% on single-image VQA tasks, although closed-weights models perform closer to human levels (within 10%). Our findings highlight that understanding food and its cultural implications remains a challenging and under-explored direction.

FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture

TL;DR

This work introduces FoodieQA, a manually curated, fine-grained image-text dataset capturing the intricate features of food cultures across various regions in China, and evaluates vision–language Models (VLMs) and large language models (LLMs) on newly collected, unseen food images and corresponding questions.

Abstract

Food is a rich and varied dimension of cultural heritage, crucial to both individuals and social groups. To bridge the gap in the literature on the often-overlooked regional diversity in this domain, we introduce FoodieQA, a manually curated, fine-grained image-text dataset capturing the intricate features of food cultures across various regions in China. We evaluate vision-language Models (VLMs) and large language models (LLMs) on newly collected, unseen food images and corresponding questions. FoodieQA comprises three multiple-choice question-answering tasks where models need to answer questions based on multiple images, a single image, and text-only descriptions, respectively. While LLMs excel at text-based question answering, surpassing human accuracy, the open-sourced VLMs still fall short by 41% on multi-image and 21% on single-image VQA tasks, although closed-weights models perform closer to human levels (within 10%). Our findings highlight that understanding food and its cultural implications remains a challenging and under-explored direction.
Paper Structure (38 sections, 15 figures, 12 tables)

This paper contains 38 sections, 15 figures, 12 tables.

Figures (15)

  • Figure 1: An example of regional food differences in referring to hotpot in China. The depicted soups and dishware visually reflect the ingredients, flavors, and traditions of these regions: Beijing in the north, Sichuan in the southwest, and Guangdong in the south coast.
  • Figure 2: The tasks in FoodieQA evaluate food culture understanding from three perspectives. Multi-image VQA requires the ability to compare multiple images, similar to how humans browse a restaurant menu. Single-image VQA assesses whether models can use visual information to better understand food culture. Text-based questions probe model performance without multimodal data.Fine-grained attributes that the questions focus on are highlighted.
  • Figure 3: Geographical distribution of cuisine types.
  • Figure 4: Meta-info annotation for local specialty.
  • Figure 5: Region distribution of collected food images.
  • ...and 10 more figures