Table of Contents
Fetching ...

FMiFood: Multi-modal Contrastive Learning for Food Image Classification

Xinyue Pan, Jiangpeng He, Fengqing Zhu

TL;DR

FMiFood tackles intra-class diversity and inter-class similarity in food image classification by integrating multi-modal contrastive learning with flexible image patch–text token matching and GPT-4 augmented text descriptions. The method combines a soft cross-entropy contrastive objective with a dedicated image-category loss, under a unified multi-task framework, and enriches textual descriptors with GPT-4 to improve discriminability. Empirical results on UPMC-Food101 and VFN show superior accuracy over strong baselines, with ablations supporting the benefits of flexible matching and auxiliary classification objective, as well as the value of enriched text descriptions. The approach enhances image-text alignment in a fine-grained, domain-specific setting and suggests directions for reducing noise in token-level matches to further boost robustness.

Abstract

Food image classification is the fundamental step in image-based dietary assessment, which aims to estimate participants' nutrient intake from eating occasion images. A common challenge of food images is the intra-class diversity and inter-class similarity, which can significantly hinder classification performance. To address this issue, we introduce a novel multi-modal contrastive learning framework called FMiFood, which learns more discriminative features by integrating additional contextual information, such as food category text descriptions, to enhance classification accuracy. Specifically, we propose a flexible matching technique that improves the similarity matching between text and image embeddings to focus on multiple key information. Furthermore, we incorporate the classification objectives into the framework and explore the use of GPT-4 to enrich the text descriptions and provide more detailed context. Our method demonstrates improved performance on both the UPMC-101 and VFN datasets compared to existing methods.

FMiFood: Multi-modal Contrastive Learning for Food Image Classification

TL;DR

FMiFood tackles intra-class diversity and inter-class similarity in food image classification by integrating multi-modal contrastive learning with flexible image patch–text token matching and GPT-4 augmented text descriptions. The method combines a soft cross-entropy contrastive objective with a dedicated image-category loss, under a unified multi-task framework, and enriches textual descriptors with GPT-4 to improve discriminability. Empirical results on UPMC-Food101 and VFN show superior accuracy over strong baselines, with ablations supporting the benefits of flexible matching and auxiliary classification objective, as well as the value of enriched text descriptions. The approach enhances image-text alignment in a fine-grained, domain-specific setting and suggests directions for reducing noise in token-level matches to further boost robustness.

Abstract

Food image classification is the fundamental step in image-based dietary assessment, which aims to estimate participants' nutrient intake from eating occasion images. A common challenge of food images is the intra-class diversity and inter-class similarity, which can significantly hinder classification performance. To address this issue, we introduce a novel multi-modal contrastive learning framework called FMiFood, which learns more discriminative features by integrating additional contextual information, such as food category text descriptions, to enhance classification accuracy. Specifically, we propose a flexible matching technique that improves the similarity matching between text and image embeddings to focus on multiple key information. Furthermore, we incorporate the classification objectives into the framework and explore the use of GPT-4 to enrich the text descriptions and provide more detailed context. Our method demonstrates improved performance on both the UPMC-101 and VFN datasets compared to existing methods.
Paper Structure (14 sections, 14 equations, 5 figures, 3 tables)

This paper contains 14 sections, 14 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Examples of inter-class similarity and intra-class diversity.
  • Figure 2: Overview of our FMiFood model: $|B|$ denotes the batch size and $|C|$ represents the number of text labels in the dataset. Images and text descriptions are fed into the image encoders and text encoders of the FMiFood model to extract image patch and text token or label token features. The similarity score between image-text pairs is computed based on the flexible matching technique to learn with both the contrastive loss and the categorical loss.
  • Figure 3: Issue with contrastive learning on current multi-modal contrastive learning model: In a single batch, we cannot assume one image is only matched to one text in contrastive learning under image classification task.
  • Figure 4: Partial confusion matrix for selected categories from the UPMC-Food101 and VFN datasets for different methods
  • Figure 5: Qualitative result of comparison between FILIP and FMiFood. The words in red are the text tokens that are matched to the image patch