Table of Contents
Fetching ...

Nutrition Estimation for Dietary Management: A Transformer Approach with Depth Sensing

Zhengyi Kwan, Wei Zhang, Zhengkui Wang, Aik Beng Ng, Simon See

TL;DR

This work tackles nutrition estimation for dietary management by introducing NuNet, a depth-aware transformer that fuses RGB and depth information via a dual-branch, multi-scale architecture. The model combines a Swin Transformer-based encoder for RGB and depth, two fusion modules (FL for lightweight fusion and FE for enhanced fusion), and a multi-scale decoder with deep supervision to predict five nutritional factors. Empirical results on Nutrition5k show NuNet achieving a mean MAPE of 15.65%, outperforming RGB-only and RGB+depth baselines and confirming the value of depth data and multi-scale fusion. The study highlights the importance of carefully designed feature fusion and multi-scale processing for robust nutrition estimation, with practical implications for dietary management and cross-domain multi-modal applications.

Abstract

Nutrition estimation is crucial for effective dietary management and overall health and well-being. Existing methods often struggle with sub-optimal accuracy and can be time-consuming. In this paper, we propose NuNet, a transformer-based network designed for nutrition estimation that utilizes both RGB and depth information from food images. We have designed and implemented a multi-scale encoder and decoder, along with two types of feature fusion modules, specialized for estimating five nutritional factors. These modules effectively balance the efficiency and effectiveness of feature extraction with flexible usage of our customized attention mechanisms and fusion strategies. Our experimental study shows that NuNet outperforms its variants and existing solutions significantly for nutrition estimation. It achieves an error rate of 15.65%, the lowest known to us, largely due to our multi-scale architecture and fusion modules. This research holds practical values for dietary management with huge potential for transnational research and deployment and could inspire other applications involving multiple data types with varying degrees of importance.

Nutrition Estimation for Dietary Management: A Transformer Approach with Depth Sensing

TL;DR

This work tackles nutrition estimation for dietary management by introducing NuNet, a depth-aware transformer that fuses RGB and depth information via a dual-branch, multi-scale architecture. The model combines a Swin Transformer-based encoder for RGB and depth, two fusion modules (FL for lightweight fusion and FE for enhanced fusion), and a multi-scale decoder with deep supervision to predict five nutritional factors. Empirical results on Nutrition5k show NuNet achieving a mean MAPE of 15.65%, outperforming RGB-only and RGB+depth baselines and confirming the value of depth data and multi-scale fusion. The study highlights the importance of carefully designed feature fusion and multi-scale processing for robust nutrition estimation, with practical implications for dietary management and cross-domain multi-modal applications.

Abstract

Nutrition estimation is crucial for effective dietary management and overall health and well-being. Existing methods often struggle with sub-optimal accuracy and can be time-consuming. In this paper, we propose NuNet, a transformer-based network designed for nutrition estimation that utilizes both RGB and depth information from food images. We have designed and implemented a multi-scale encoder and decoder, along with two types of feature fusion modules, specialized for estimating five nutritional factors. These modules effectively balance the efficiency and effectiveness of feature extraction with flexible usage of our customized attention mechanisms and fusion strategies. Our experimental study shows that NuNet outperforms its variants and existing solutions significantly for nutrition estimation. It achieves an error rate of 15.65%, the lowest known to us, largely due to our multi-scale architecture and fusion modules. This research holds practical values for dietary management with huge potential for transnational research and deployment and could inspire other applications involving multiple data types with varying degrees of importance.
Paper Structure (36 sections, 11 equations, 6 figures, 4 tables)

This paper contains 36 sections, 11 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: An illustration of the system architecture of NuNet. A smartphone with depth sensing captures both RGB and depth images, which are processed by our NuNet for nutrition estimation. The estimation of key nutritional factors is shared with the users and utilized for enhanced dietary management.
  • Figure 2: Sample food images with both RGB and depth information from the Nutrition5k dataset thames2021nutrition5k.
  • Figure 3: An illustration of the system architecture of NuNet. NuNet consists of three key components, including a multi-scale encoder, a feature fusion module, and a multi-scale decoder. It utilizes both RGB and depth images as input and analyzes the data using its transformer architecture. Finally, NuNet generates the nutrition estimation values for dietary management.
  • Figure 4: An illustration of FL for lightweight feature fusion. FL performs addition operation of the RGB and depth features from the same encoder scale to generate $\mathbf{f}$. The final FL output is the addition of $\mathbf{f}$ and $\mathbf{f}'$, which is generated by an attention module (W-MSA) based on $\mathbf{f}$.
  • Figure 5: An illustration of FE for enhanced feature fusion. FE has three fusion paths for concatenation, multiplication, and addition. Each path utilizes both RGB and depth features in different ways and introduce two attentions (W-MSA and SW-MSA) to process the features. The output of the three paths merge into an MLP before the final FE feature is generated.
  • ...and 1 more figures