OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation
Xiongwei Wu, Sicheng Yu, Ee-Peng Lim, Chong-Wah Ngo
TL;DR
OVFoodSeg tackles open-vocabulary food image segmentation by enriching static CLIP text embeddings with image-specific cues through the FoodLearner and the Image-Informed Text Encoder. It introduces a two-stage training regime: Stage I pre-trains FoodLearner on food image–text pairs with ITC, ITM, and LM losses to align visual with textual food representations, and Stage II fine-tunes segmentation using image-informed text embeddings within a SAN-based framework trained with CE and Dice losses. On FoodSeg103 and FoodSeg195, OVFoodSeg achieves state-of-the-art performance, notably $mIoU$ gains of $4.9\%$ on novel classes in FoodSeg103 and $3.5\%$ on FoodSeg195, with ablations highlighting the importance of Stage I losses and the LM objective. This approach reduces annotation burdens and improves generalization to unseen ingredients, offering a practical boost for open-vocabulary food analysis in real-world applications.
Abstract
In the realm of food computing, segmenting ingredients from images poses substantial challenges due to the large intra-class variance among the same ingredients, the emergence of new ingredients, and the high annotation costs associated with large food segmentation datasets. Existing approaches primarily utilize a closed-vocabulary and static text embeddings setting. These methods often fall short in effectively handling the ingredients, particularly new and diverse ones. In response to these limitations, we introduce OVFoodSeg, a framework that adopts an open-vocabulary setting and enhances text embeddings with visual context. By integrating vision-language models (VLMs), our approach enriches text embedding with image-specific information through two innovative modules, eg, an image-to-text learner FoodLearner and an Image-Informed Text Encoder. The training process of OVFoodSeg is divided into two stages: the pre-training of FoodLearner and the subsequent learning phase for segmentation. The pre-training phase equips FoodLearner with the capability to align visual information with corresponding textual representations that are specifically related to food, while the second phase adapts both the FoodLearner and the Image-Informed Text Encoder for the segmentation task. By addressing the deficiencies of previous models, OVFoodSeg demonstrates a significant improvement, achieving an 4.9\% increase in mean Intersection over Union (mIoU) on the FoodSeg103 dataset, setting a new milestone for food image segmentation.
