FMiFood: Multi-modal Contrastive Learning for Food Image Classification
Xinyue Pan, Jiangpeng He, Fengqing Zhu
TL;DR
FMiFood tackles intra-class diversity and inter-class similarity in food image classification by integrating multi-modal contrastive learning with flexible image patch–text token matching and GPT-4 augmented text descriptions. The method combines a soft cross-entropy contrastive objective with a dedicated image-category loss, under a unified multi-task framework, and enriches textual descriptors with GPT-4 to improve discriminability. Empirical results on UPMC-Food101 and VFN show superior accuracy over strong baselines, with ablations supporting the benefits of flexible matching and auxiliary classification objective, as well as the value of enriched text descriptions. The approach enhances image-text alignment in a fine-grained, domain-specific setting and suggests directions for reducing noise in token-level matches to further boost robustness.
Abstract
Food image classification is the fundamental step in image-based dietary assessment, which aims to estimate participants' nutrient intake from eating occasion images. A common challenge of food images is the intra-class diversity and inter-class similarity, which can significantly hinder classification performance. To address this issue, we introduce a novel multi-modal contrastive learning framework called FMiFood, which learns more discriminative features by integrating additional contextual information, such as food category text descriptions, to enhance classification accuracy. Specifically, we propose a flexible matching technique that improves the similarity matching between text and image embeddings to focus on multiple key information. Furthermore, we incorporate the classification objectives into the framework and explore the use of GPT-4 to enrich the text descriptions and provide more detailed context. Our method demonstrates improved performance on both the UPMC-101 and VFN datasets compared to existing methods.
