Table of Contents
Fetching ...

OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation

Xiongwei Wu, Sicheng Yu, Ee-Peng Lim, Chong-Wah Ngo

TL;DR

OVFoodSeg tackles open-vocabulary food image segmentation by enriching static CLIP text embeddings with image-specific cues through the FoodLearner and the Image-Informed Text Encoder. It introduces a two-stage training regime: Stage I pre-trains FoodLearner on food image–text pairs with ITC, ITM, and LM losses to align visual with textual food representations, and Stage II fine-tunes segmentation using image-informed text embeddings within a SAN-based framework trained with CE and Dice losses. On FoodSeg103 and FoodSeg195, OVFoodSeg achieves state-of-the-art performance, notably $mIoU$ gains of $4.9\%$ on novel classes in FoodSeg103 and $3.5\%$ on FoodSeg195, with ablations highlighting the importance of Stage I losses and the LM objective. This approach reduces annotation burdens and improves generalization to unseen ingredients, offering a practical boost for open-vocabulary food analysis in real-world applications.

Abstract

In the realm of food computing, segmenting ingredients from images poses substantial challenges due to the large intra-class variance among the same ingredients, the emergence of new ingredients, and the high annotation costs associated with large food segmentation datasets. Existing approaches primarily utilize a closed-vocabulary and static text embeddings setting. These methods often fall short in effectively handling the ingredients, particularly new and diverse ones. In response to these limitations, we introduce OVFoodSeg, a framework that adopts an open-vocabulary setting and enhances text embeddings with visual context. By integrating vision-language models (VLMs), our approach enriches text embedding with image-specific information through two innovative modules, eg, an image-to-text learner FoodLearner and an Image-Informed Text Encoder. The training process of OVFoodSeg is divided into two stages: the pre-training of FoodLearner and the subsequent learning phase for segmentation. The pre-training phase equips FoodLearner with the capability to align visual information with corresponding textual representations that are specifically related to food, while the second phase adapts both the FoodLearner and the Image-Informed Text Encoder for the segmentation task. By addressing the deficiencies of previous models, OVFoodSeg demonstrates a significant improvement, achieving an 4.9\% increase in mean Intersection over Union (mIoU) on the FoodSeg103 dataset, setting a new milestone for food image segmentation.

OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation

TL;DR

OVFoodSeg tackles open-vocabulary food image segmentation by enriching static CLIP text embeddings with image-specific cues through the FoodLearner and the Image-Informed Text Encoder. It introduces a two-stage training regime: Stage I pre-trains FoodLearner on food image–text pairs with ITC, ITM, and LM losses to align visual with textual food representations, and Stage II fine-tunes segmentation using image-informed text embeddings within a SAN-based framework trained with CE and Dice losses. On FoodSeg103 and FoodSeg195, OVFoodSeg achieves state-of-the-art performance, notably gains of on novel classes in FoodSeg103 and on FoodSeg195, with ablations highlighting the importance of Stage I losses and the LM objective. This approach reduces annotation burdens and improves generalization to unseen ingredients, offering a practical boost for open-vocabulary food analysis in real-world applications.

Abstract

In the realm of food computing, segmenting ingredients from images poses substantial challenges due to the large intra-class variance among the same ingredients, the emergence of new ingredients, and the high annotation costs associated with large food segmentation datasets. Existing approaches primarily utilize a closed-vocabulary and static text embeddings setting. These methods often fall short in effectively handling the ingredients, particularly new and diverse ones. In response to these limitations, we introduce OVFoodSeg, a framework that adopts an open-vocabulary setting and enhances text embeddings with visual context. By integrating vision-language models (VLMs), our approach enriches text embedding with image-specific information through two innovative modules, eg, an image-to-text learner FoodLearner and an Image-Informed Text Encoder. The training process of OVFoodSeg is divided into two stages: the pre-training of FoodLearner and the subsequent learning phase for segmentation. The pre-training phase equips FoodLearner with the capability to align visual information with corresponding textual representations that are specifically related to food, while the second phase adapts both the FoodLearner and the Image-Informed Text Encoder for the segmentation task. By addressing the deficiencies of previous models, OVFoodSeg demonstrates a significant improvement, achieving an 4.9\% increase in mean Intersection over Union (mIoU) on the FoodSeg103 dataset, setting a new milestone for food image segmentation.
Paper Structure (17 sections, 13 equations, 5 figures, 5 tables)

This paper contains 17 sections, 13 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: (a) Top: Conventional open-vocabulary segmentation framework Side Adaptive Network (SAN) xu2023san, predicts mask category logits by raw text embeddings from CLIP; Bottom: Proposed OVFoodSeg, constructs image-informed text embeddings through FoodLearner and the Image-Informed Text Encoder for mask classification. (b) This example illustrates the use of both SAN and the proposed OVFoodSeg in identifying egg masks cooked using different methods.
  • Figure 2: The figure depicts the pipeline of Stage I FoodLearner Pre-training. Stage I is dedicated to pre-training the FoodLearner module with image-text pairs pertinent to food so that the visual information closely related to the accompanying text will be extracted to enrich the text representation.
  • Figure 3: This figure depicts the pipeline of Stage II Segmentation Learning, focusing on training the segmenter using image-informed text embeddings. The FoodLearner extracts image-specific information which are then combined with the text embeddings from Image-Informed Text Encoder to produce the final image-informed text embeddings. Noted that modules with the same name share the parameters, i.e., CLIP image encoder and CLIP text encoder.
  • Figure 4: Visulization Results on FoodSeg103 where GT means ground-truth. OVFoodSeg achieves better performance especially for novel classes.
  • Figure 5: Failure cases of OVFoodSeg on FoodSeg103 (Split 1) where GT means ground-truth. In this example, OVFoodSeg incorrectly classified "white button mushroom", a novel class, as "shiitake", which is a base class.