Table of Contents
Fetching ...

Knowledge distillation to effectively attain both region-of-interest and global semantics from an image where multiple objects appear

Seonwhee Jin

TL;DR

The paper tackles accurate localization and fine-grained classification of foods in complex images where multiple objects appear. It uses the Segment-Anything Model (SAM) to isolate the ROI and recasts detection as ROI-focused classification. The proposed RveRNet architecture jointly processes ROI and global context via ROI and extra-ROI modules, with transformer-based DeiT configurations achieving the strongest performance and revealing a trade-off in CNN-to-DeiT knowledge distillation under noise. Experiments on FoodSeg103 and ketchup/chili paste scenarios show improved F1 on ambiguous cases and robustness to input perturbations, and the authors provide public code.

Abstract

Models based on convolutional neural networks (CNN) and transformers have steadily been improved. They also have been applied in various computer vision downstream tasks. However, in object detection tasks, accurately localizing and classifying almost infinite categories of foods in images remains challenging. To address these problems, we first segmented the food as the region-of-interest (ROI) by using the segment-anything model (SAM) and masked the rest of the region except ROI as black pixels. This process simplified the problems into a single classification for which annotation and training were much simpler than object detection. The images in which only the ROI was preserved were fed as inputs to fine-tune various off-the-shelf models that encoded their own inductive biases. Among them, Data-efficient image Transformers (DeiTs) had the best classification performance. Nonetheless, when foods' shapes and textures were similar, the contextual features of the ROI-only images were not enough for accurate classification. Therefore, we introduced a novel type of combined architecture, RveRNet, which consisted of ROI, extra-ROI, and integration modules that allowed it to account for both the ROI's and global contexts. The RveRNet's F1 score was 10% better than other individual models when classifying ambiguous food images. If the RveRNet's modules were DeiT with the knowledge distillation from the CNN, performed the best. We investigated how architectures can be made robust against input noise caused by permutation and translocation. The results indicated that there was a trade-off between how much the CNN teacher's knowledge could be distilled to DeiT and DeiT's innate strength. Code is publicly available at: https://github.com/Seonwhee-Genome/RveRNet.

Knowledge distillation to effectively attain both region-of-interest and global semantics from an image where multiple objects appear

TL;DR

The paper tackles accurate localization and fine-grained classification of foods in complex images where multiple objects appear. It uses the Segment-Anything Model (SAM) to isolate the ROI and recasts detection as ROI-focused classification. The proposed RveRNet architecture jointly processes ROI and global context via ROI and extra-ROI modules, with transformer-based DeiT configurations achieving the strongest performance and revealing a trade-off in CNN-to-DeiT knowledge distillation under noise. Experiments on FoodSeg103 and ketchup/chili paste scenarios show improved F1 on ambiguous cases and robustness to input perturbations, and the authors provide public code.

Abstract

Models based on convolutional neural networks (CNN) and transformers have steadily been improved. They also have been applied in various computer vision downstream tasks. However, in object detection tasks, accurately localizing and classifying almost infinite categories of foods in images remains challenging. To address these problems, we first segmented the food as the region-of-interest (ROI) by using the segment-anything model (SAM) and masked the rest of the region except ROI as black pixels. This process simplified the problems into a single classification for which annotation and training were much simpler than object detection. The images in which only the ROI was preserved were fed as inputs to fine-tune various off-the-shelf models that encoded their own inductive biases. Among them, Data-efficient image Transformers (DeiTs) had the best classification performance. Nonetheless, when foods' shapes and textures were similar, the contextual features of the ROI-only images were not enough for accurate classification. Therefore, we introduced a novel type of combined architecture, RveRNet, which consisted of ROI, extra-ROI, and integration modules that allowed it to account for both the ROI's and global contexts. The RveRNet's F1 score was 10% better than other individual models when classifying ambiguous food images. If the RveRNet's modules were DeiT with the knowledge distillation from the CNN, performed the best. We investigated how architectures can be made robust against input noise caused by permutation and translocation. The results indicated that there was a trade-off between how much the CNN teacher's knowledge could be distilled to DeiT and DeiT's innate strength. Code is publicly available at: https://github.com/Seonwhee-Genome/RveRNet.
Paper Structure (20 sections, 8 equations, 7 figures, 12 tables)

This paper contains 20 sections, 8 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Natural food images. (a) Multiple kinds of foods on one plate. Annotated segmentation masks are shown. The original image is from the FoodSeg103 dataset. (b) Without any contextual clues, it is almost impossible to accurately classify which one is water and liquor. Photo by Seonwhee Jin, 2024.
  • Figure 2: The structure of the proposed RveRNet We used the robust SAM foundation model to segment the ROI of input images. Then, we processed the images to produce complementary cut-out pairs that were used as inputs for both the ROI and extra-ROI modules. The ROI and extra-ROI modules can have different architectures that encode different inductive biases.
  • Figure 3: Examples of classification by various off-the-shelf models. (a) A case in which transformer-based models succeeded and the CNN failed. (b) A case in which the CNN model succeeded and the transformer-based models failed. In all figures and tables, we abbreviated “ground truth” as “GT” and “prediction as “pred.”
  • Figure 4: GradCAM visualization chili paste images. (a) Original ROI image. (b) Prediction by individual MobileNetV2 network. (c) Prediction by RveRNet with both modules being MobileNetV2.
  • Figure 5: Cases where individual off-the-shelf models failed but RveRNet succeeded. The parallel structure of RveRNet enables it to take an image’s global context into account to more accurately classify ambiguous foods like ketchup and chili paste.
  • ...and 2 more figures