Table of Contents
Fetching ...

Long-Tailed Continual Learning For Visual Food Recognition

Jiangpeng He, Xiaoyan Zhang, Luotao Lin, Jack Ma, Heather A. Eicher-Miller, Fengqing Zhu

TL;DR

This work tackles long-tailed continual learning for visual food recognition by introducing a unified end-to-end framework that combines feature-based knowledge distillation with a learnable prediction head and CAM-guided CutMix augmentation. It also contributes a new 186-item VFN186 dataset and three population-specific long-tailed benchmarks (VFN186-LT, VFN186-INSULIN, VFN186-T2D) to reflect real-world dietary patterns. Empirical results demonstrate significant gains over existing continual learning methods across multiple LT food datasets, with analyses highlighting the contributions of each component and the practicality of the approach in terms of training efficiency and memory usage. The findings have direct implications for deploying robust, privacy-aware, real-world food recognition systems in diverse populations, guiding future work toward exemplar-free and scalable frameworks.

Abstract

Deep learning-based food recognition has made significant progress in predicting food types from eating occasion images. However, two key challenges hinder real-world deployment: (1) continuously learning new food classes without forgetting previously learned ones, and (2) handling the long-tailed distribution of food images, where a few common classes and many more rare classes. To address these, food recognition methods should focus on long-tailed continual learning. In this work, We introduce a dataset that encompasses 186 American foods along with comprehensive annotations. We also introduce three new benchmark datasets, VFN186-LT, VFN186-INSULIN and VFN186-T2D, which reflect real-world food consumption for healthy populations, insulin takers and individuals with type 2 diabetes without taking insulin. We propose a novel end-to-end framework that improves the generalization ability for instance-rare food classes using a knowledge distillation-based predictor to avoid misalignment of representation during continual learning. Additionally, we introduce an augmentation technique by integrating class-activation-map (CAM) and CutMix to improve generalization on instance-rare food classes. Our method, evaluated on Food101-LT, VFN-LT, VFN186-LT, VFN186-INSULIN, and VFN186-T2DM, shows significant improvements over existing methods. An ablation study highlights further performance enhancements, demonstrating its potential for real-world food recognition applications.

Long-Tailed Continual Learning For Visual Food Recognition

TL;DR

This work tackles long-tailed continual learning for visual food recognition by introducing a unified end-to-end framework that combines feature-based knowledge distillation with a learnable prediction head and CAM-guided CutMix augmentation. It also contributes a new 186-item VFN186 dataset and three population-specific long-tailed benchmarks (VFN186-LT, VFN186-INSULIN, VFN186-T2D) to reflect real-world dietary patterns. Empirical results demonstrate significant gains over existing continual learning methods across multiple LT food datasets, with analyses highlighting the contributions of each component and the practicality of the approach in terms of training efficiency and memory usage. The findings have direct implications for deploying robust, privacy-aware, real-world food recognition systems in diverse populations, guiding future work toward exemplar-free and scalable frameworks.

Abstract

Deep learning-based food recognition has made significant progress in predicting food types from eating occasion images. However, two key challenges hinder real-world deployment: (1) continuously learning new food classes without forgetting previously learned ones, and (2) handling the long-tailed distribution of food images, where a few common classes and many more rare classes. To address these, food recognition methods should focus on long-tailed continual learning. In this work, We introduce a dataset that encompasses 186 American foods along with comprehensive annotations. We also introduce three new benchmark datasets, VFN186-LT, VFN186-INSULIN and VFN186-T2D, which reflect real-world food consumption for healthy populations, insulin takers and individuals with type 2 diabetes without taking insulin. We propose a novel end-to-end framework that improves the generalization ability for instance-rare food classes using a knowledge distillation-based predictor to avoid misalignment of representation during continual learning. Additionally, we introduce an augmentation technique by integrating class-activation-map (CAM) and CutMix to improve generalization on instance-rare food classes. Our method, evaluated on Food101-LT, VFN-LT, VFN186-LT, VFN186-INSULIN, and VFN186-T2DM, shows significant improvements over existing methods. An ablation study highlights further performance enhancements, demonstrating its potential for real-world food recognition applications.
Paper Structure (32 sections, 10 equations, 7 figures, 7 tables)

This paper contains 32 sections, 10 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: The distribution of VFN186-LT, VFN186-INSULIN, and VFN186-T2D shown in descending order based on the number of training samples.
  • Figure 2: The overview of our proposed framework. The red arrows show the training process with new class images and exemplars from previous classes. The blue arrows denote the steps after the training process where we construct a balanced exemplar set and store them in the memory buffer.
  • Figure 3: The overview of proposed feature-based knowledge distillation by applying an additional predictor $g$.
  • Figure 4: The overview of proposed CAM-based data augmentation technique. The green arrow describes the selection of the most visually similar candidate image and the red arrow illustrates the steps to obtain the most important region of the input image to perform CutMix yun2019cutmix.
  • Figure 5: Results on Food101-LT, VFN-LT, VFN186-LT, VFN186-INSULIN and VFN186-T2D with different number of tasks $N$. Each marker represents the Top-1 classification accuracy evaluated on all classes seen so far after learning each task.
  • ...and 2 more figures