Table of Contents
Fetching ...

NutriTransform: Estimating Nutritional Information From Online Food Posts

Thorsten Ruprechter, Marion Garaus, Ivo Ponocny, Denis Helic

TL;DR

NutriTransform tackles the problem of estimating macro-nutrient content from short online post titles, where explicit nutrition data is unavailable. It combines a public USDA food database with SentenceTransformer embeddings to map titles to semantically similar foods and aggregates their nutrition, tuned on a labeled recipe dataset. The approach achieves competitive RMSE relative to an API-based baseline and is applied to over 500k Reddit r/food posts to uncover longitudinal dietary trends. The work provides a practical, scalable tool for nutrition inference from text and opens avenues for computational social science and health research using minimal textual data.

Abstract

Deriving nutritional information from online food posts is challenging, particularly when users do not explicitly log the macro-nutrients of a shared meal. In this work, we present an efficient and straightforward approach to approximating macro-nutrients based solely on the titles of food posts. Our method combines a public food database from the U.S. Department of Agriculture with advanced text embedding techniques. We evaluate the approach on a labeled food dataset, demonstrating its effectiveness, and apply it to over 500,000 real-world posts from Reddit's popular /r/food subreddit to uncover trends in food-sharing behavior based on the estimated macro-nutrient content. Altogether, this work lays a foundation for researchers and practitioners aiming to estimate caloric and nutritional content using only text data.

NutriTransform: Estimating Nutritional Information From Online Food Posts

TL;DR

NutriTransform tackles the problem of estimating macro-nutrient content from short online post titles, where explicit nutrition data is unavailable. It combines a public USDA food database with SentenceTransformer embeddings to map titles to semantically similar foods and aggregates their nutrition, tuned on a labeled recipe dataset. The approach achieves competitive RMSE relative to an API-based baseline and is applied to over 500k Reddit r/food posts to uncover longitudinal dietary trends. The work provides a practical, scalable tool for nutrition inference from text and opens avenues for computational social science and health research using minimal textual data.

Abstract

Deriving nutritional information from online food posts is challenging, particularly when users do not explicitly log the macro-nutrients of a shared meal. In this work, we present an efficient and straightforward approach to approximating macro-nutrients based solely on the titles of food posts. Our method combines a public food database from the U.S. Department of Agriculture with advanced text embedding techniques. We evaluate the approach on a labeled food dataset, demonstrating its effectiveness, and apply it to over 500,000 real-world posts from Reddit's popular /r/food subreddit to uncover trends in food-sharing behavior based on the estimated macro-nutrient content. Altogether, this work lays a foundation for researchers and practitioners aiming to estimate caloric and nutritional content using only text data.

Paper Structure

This paper contains 4 sections, 3 figures.

Figures (3)

  • Figure 1: Pipeline for estimating nutritional values. We first generate SentenceTransformer embeddings from all foods from the USDA food database. Afterwards, when labeling a new food post, such as a submission to /r/food on Reddit, we embed the title of the post and retrieve the $n$ most similar entries that exceed a predefined similarity threshold $t$. Finally, we aggregate the nutritional values of these most-similar items to estimate the macro-nutrients (e.g., calories) for the unlabeled food titles.
  • Figure 2: Food posts on Reddit. The /r/food subreddit is one of the largest sub-communities on the popular online discussion platform Reddit. Users share food-related posts within this community, receiving engagement in the form of upvotes (i.e., likes) or comments (Fig. \ref{['fig:food_sub']}). Since 2017, /r/food has consistently received between $1\,300$ and $2\,300$ posts per week, with activity peaking at over $4\,000$ weekly posts following the onset of the COVID-19 pandemic in March 2020 (Fig. \ref{['fig:food_post_counts']}). Similarly, the early phases of the pandemic saw the highest number of unique weekly contributors, with over $1\,400$ individual users posting each week (Fig. \ref{['fig:food_author_counts']}).
  • Figure 3: Nutritional values of foods shared on Reddit over the years. We visualize weekly medians of food posts on r/food for four nutritional metrics: calories, protein, fat, and carbohydrates per 100 grams. Our investigation reveals general trends, such as higher calorie counts toward the end of most years, as well as a notable plateau from March to June 2020, coinciding with the onset of the COVID-19 pandemic.