Table of Contents
Fetching ...

Multimodal ML: Quantifying the Improvement of Calorie Estimation Through Image-Text Pairs

Arya Narang

TL;DR

The paper tackles whether brief textual prompts can improve calorie estimation from images beyond an image-only CNN. It trains two CNNs on Nutrition5k—a lightweight image-only model and a multimodal model incorporating a short dish-name input via text encoding and cross-modal attention—and reports MAE changes from $84.76$ to $83.70$ kcal with modest $R^2$ gains. A paired $t$-test with $n=653$ and $\alpha=0.1$ yields $t=0.6339$, $p=0.2623$, indicating the improvement is not statistically significant, though the multimodal model shows reduced absolute-error variability. The work highlights that brief text can offer stability and potential gains, but larger and more diverse datasets with richer fusion mechanisms are needed to realize significant, generalizable benefits for practical deployment.

Abstract

This paper determines the extent to which short textual inputs (in this case, names of dishes) can improve calorie estimation compared to an image-only baseline model and whether any improvements are statistically significant. Utilizes the TensorFlow library and the Nutrition5k dataset (curated by Google) to train both an image-only CNN and multimodal CNN that accepts both text and an image as input. The MAE of calorie estimations was reduced by 1.06 kcal from 84.76 kcal to 83.70 kcal (1.25% improvement) when using the multimodal model.

Multimodal ML: Quantifying the Improvement of Calorie Estimation Through Image-Text Pairs

TL;DR

The paper tackles whether brief textual prompts can improve calorie estimation from images beyond an image-only CNN. It trains two CNNs on Nutrition5k—a lightweight image-only model and a multimodal model incorporating a short dish-name input via text encoding and cross-modal attention—and reports MAE changes from to kcal with modest gains. A paired -test with and yields , , indicating the improvement is not statistically significant, though the multimodal model shows reduced absolute-error variability. The work highlights that brief text can offer stability and potential gains, but larger and more diverse datasets with richer fusion mechanisms are needed to realize significant, generalizable benefits for practical deployment.

Abstract

This paper determines the extent to which short textual inputs (in this case, names of dishes) can improve calorie estimation compared to an image-only baseline model and whether any improvements are statistically significant. Utilizes the TensorFlow library and the Nutrition5k dataset (curated by Google) to train both an image-only CNN and multimodal CNN that accepts both text and an image as input. The MAE of calorie estimations was reduced by 1.06 kcal from 84.76 kcal to 83.70 kcal (1.25% improvement) when using the multimodal model.

Paper Structure

This paper contains 23 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overview of proposed image-only CNN model. Consists of 2 dense linear layers to help the model understand the features of dishes.
  • Figure 2: Unimodal performance illustration
  • Figure 3: Multimodal performance illustration
  • Figure 4: Overview of proposed multimodal CNN model. Consists of an image and text branch and a multi-head attention layer which fuses the 2 inputs to develop an advanced understanding of the relationship between them.