NurtureNet: A Multi-task Video-based Approach for Newborn Anthropometry
Yash Khandelwal, Mayur Arvind, Sriram Kumar, Ashish Gupta, Sachin Kumar Danisetty, Piyush Bagad, Anish Madan, Mayank Lunayach, Aditya Annavajjala, Abhishek Maiti, Sansiddh Jain, Aman Dalmia, Namrata Deka, Jerome White, Jigar Doshi, Angjoo Kanazawa, Rahul Panicker, Alpan Raval, Srinivas Rana, Makarand Tapaswi
TL;DR
The paper tackles the public health challenge of malnutrition screening in newborns by enabling contactless anthropometry in rural LMIC settings. It introduces NurtureNet, a video-based, multi-task regression framework that ingests RGB video from a low-cost smartphone and augments visual features with birth weight and age to predict weight, length, head circumference, and chest circumference, formalized as w = MLP_w([z, w^0, a]). It leverages proxy vision tasks—segmentation and keypoints—via pseudo-labels to improve representation, achieving a weight MAE of $114.3$ g and a relative error of $3.9\%$, while remaining deployable offline on devices around $15$ MB. Extensive rural-field experiments (12,901 videos) show robustness to noisy tabular inputs and substantial improvements over conventional practices (e.g., MAE of $183$ g for spring-balance readings). The approach offers a scalable, geo-tagged, contactless solution to monitor newborn growth and inform timely interventions, with potential for large-scale impact in public health programs.
Abstract
Malnutrition among newborns is a top public health concern in developing countries. Identification and subsequent growth monitoring are key to successful interventions. However, this is challenging in rural communities where health systems tend to be inaccessible and under-equipped, with poor adherence to protocol. Our goal is to equip health workers and public health systems with a solution for contactless newborn anthropometry in the community. We propose NurtureNet, a multi-task model that fuses visual information (a video taken with a low-cost smartphone) with tabular inputs to regress multiple anthropometry estimates including weight, length, head circumference, and chest circumference. We show that visual proxy tasks of segmentation and keypoint prediction further improve performance. We establish the efficacy of the model through several experiments and achieve a relative error of 3.9% and mean absolute error of 114.3 g for weight estimation. Model compression to 15 MB also allows offline deployment to low-cost smartphones.
