Nutrition Estimation for Dietary Management: A Transformer Approach with Depth Sensing
Zhengyi Kwan, Wei Zhang, Zhengkui Wang, Aik Beng Ng, Simon See
TL;DR
This work tackles nutrition estimation for dietary management by introducing NuNet, a depth-aware transformer that fuses RGB and depth information via a dual-branch, multi-scale architecture. The model combines a Swin Transformer-based encoder for RGB and depth, two fusion modules (FL for lightweight fusion and FE for enhanced fusion), and a multi-scale decoder with deep supervision to predict five nutritional factors. Empirical results on Nutrition5k show NuNet achieving a mean MAPE of 15.65%, outperforming RGB-only and RGB+depth baselines and confirming the value of depth data and multi-scale fusion. The study highlights the importance of carefully designed feature fusion and multi-scale processing for robust nutrition estimation, with practical implications for dietary management and cross-domain multi-modal applications.
Abstract
Nutrition estimation is crucial for effective dietary management and overall health and well-being. Existing methods often struggle with sub-optimal accuracy and can be time-consuming. In this paper, we propose NuNet, a transformer-based network designed for nutrition estimation that utilizes both RGB and depth information from food images. We have designed and implemented a multi-scale encoder and decoder, along with two types of feature fusion modules, specialized for estimating five nutritional factors. These modules effectively balance the efficiency and effectiveness of feature extraction with flexible usage of our customized attention mechanisms and fusion strategies. Our experimental study shows that NuNet outperforms its variants and existing solutions significantly for nutrition estimation. It achieves an error rate of 15.65%, the lowest known to us, largely due to our multi-scale architecture and fusion modules. This research holds practical values for dietary management with huge potential for transnational research and deployment and could inspire other applications involving multiple data types with varying degrees of importance.
