Table of Contents
Fetching ...

Weakly Supervised Food Image Segmentation using Vision Transformers and Segment Anything Model

Ioannis Sarafis, Alexandros Papadopoulos, Anastasios Delopoulos

TL;DR

The paper tackles weakly supervised semantic segmentation for food images by chaining a ViT-based classifier with Grad-CAM-derived prompts to SAM, enabling class-aware segmentation using only image-level labels. By fine-tuning a Swin Transformer for multi-label food detection and using Grad-CAM to locate high-activation prompts, SAM can generate single or multiple masks per predicted class, with Gaussian smoothing explored to improve mask coherence. Evaluated on FoodSeg103, the approach achieves up to a 0.54 mean IoU in the multi-mask setting, illustrating strong potential for accelerating food annotation and supporting semi-automatic workflows in nutrition tracking. The study highlights the trade-offs between input preprocessing and mask strategies, and discusses practical limitations and directions for scaling to larger datasets and exploring alternative prompting strategies.

Abstract

In this paper, we propose a weakly supervised semantic segmentation approach for food images which takes advantage of the zero-shot capabilities and promptability of the Segment Anything Model (SAM) along with the attention mechanisms of Vision Transformers (ViTs). Specifically, we use class activation maps (CAMs) from ViTs to generate prompts for SAM, resulting in masks suitable for food image segmentation. The ViT model, a Swin Transformer, is trained exclusively using image-level annotations, eliminating the need for pixel-level annotations during training. Additionally, to enhance the quality of the SAM-generated masks, we examine the use of image preprocessing techniques in combination with single-mask and multi-mask SAM generation strategies. The methodology is evaluated on the FoodSeg103 dataset, generating an average of 2.4 masks per image (excluding background), and achieving an mIoU of 0.54 for the multi-mask scenario. We envision the proposed approach as a tool to accelerate food image annotation tasks or as an integrated component in food and nutrition tracking applications.

Weakly Supervised Food Image Segmentation using Vision Transformers and Segment Anything Model

TL;DR

The paper tackles weakly supervised semantic segmentation for food images by chaining a ViT-based classifier with Grad-CAM-derived prompts to SAM, enabling class-aware segmentation using only image-level labels. By fine-tuning a Swin Transformer for multi-label food detection and using Grad-CAM to locate high-activation prompts, SAM can generate single or multiple masks per predicted class, with Gaussian smoothing explored to improve mask coherence. Evaluated on FoodSeg103, the approach achieves up to a 0.54 mean IoU in the multi-mask setting, illustrating strong potential for accelerating food annotation and supporting semi-automatic workflows in nutrition tracking. The study highlights the trade-offs between input preprocessing and mask strategies, and discusses practical limitations and directions for scaling to larger datasets and exploring alternative prompting strategies.

Abstract

In this paper, we propose a weakly supervised semantic segmentation approach for food images which takes advantage of the zero-shot capabilities and promptability of the Segment Anything Model (SAM) along with the attention mechanisms of Vision Transformers (ViTs). Specifically, we use class activation maps (CAMs) from ViTs to generate prompts for SAM, resulting in masks suitable for food image segmentation. The ViT model, a Swin Transformer, is trained exclusively using image-level annotations, eliminating the need for pixel-level annotations during training. Additionally, to enhance the quality of the SAM-generated masks, we examine the use of image preprocessing techniques in combination with single-mask and multi-mask SAM generation strategies. The methodology is evaluated on the FoodSeg103 dataset, generating an average of 2.4 masks per image (excluding background), and achieving an mIoU of 0.54 for the multi-mask scenario. We envision the proposed approach as a tool to accelerate food image annotation tasks or as an integrated component in food and nutrition tracking applications.

Paper Structure

This paper contains 12 sections, 2 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Overview of the proposed methodology using an example image for illustration. For clarity in the diagram, similar or repeated paths are omitted and represented using dashed arrows. First, the input image is passed through a Swin Transformer model that has been fine-tuned for food image-level classification. Then, for each predicted class, we apply Grad-CAM algorithm to compute the corresponding class activation map (CAM). The point with the highest activation value in each CAM is selected as a prompt for Segment Anything Model (SAM). The original or a smoothed version of the image can be used as input to SAM depending on our prompt strategy, with each option having certain trade-offs. Finally, SAM can be used to either produce a single mask or multiple masks, with the latter having significant performance gains in semi-automatic usage scenarios.
  • Figure 2: Examples of segmentation masks generated using the proposed method for various food classes within images of the test set. From left to right, the columns show: (1) the original input image, (2) the ground truth segmentation mask for the class, (3) the computed Class Activation Map (CAM), with a cross marking the point of highest activation used as the prompt, (4) the segmentation mask produced by SAM using the single-mask strategy with the original image, (5) the corresponding single-mask output using a smoothed version of the image, (6) the best-performing mask among the top-3 SAM outputs (multi-mask strategy) using the original image, and (7) the corresponding best mask when using multi-mask strategy and the smoothed image.