Multimodal ML: Quantifying the Improvement of Calorie Estimation Through Image-Text Pairs
Arya Narang
TL;DR
The paper tackles whether brief textual prompts can improve calorie estimation from images beyond an image-only CNN. It trains two CNNs on Nutrition5k—a lightweight image-only model and a multimodal model incorporating a short dish-name input via text encoding and cross-modal attention—and reports MAE changes from $84.76$ to $83.70$ kcal with modest $R^2$ gains. A paired $t$-test with $n=653$ and $\alpha=0.1$ yields $t=0.6339$, $p=0.2623$, indicating the improvement is not statistically significant, though the multimodal model shows reduced absolute-error variability. The work highlights that brief text can offer stability and potential gains, but larger and more diverse datasets with richer fusion mechanisms are needed to realize significant, generalizable benefits for practical deployment.
Abstract
This paper determines the extent to which short textual inputs (in this case, names of dishes) can improve calorie estimation compared to an image-only baseline model and whether any improvements are statistically significant. Utilizes the TensorFlow library and the Nutrition5k dataset (curated by Google) to train both an image-only CNN and multimodal CNN that accepts both text and an image as input. The MAE of calorie estimations was reduced by 1.06 kcal from 84.76 kcal to 83.70 kcal (1.25% improvement) when using the multimodal model.
