Table of Contents
Fetching ...

When LLMs Can't Help: Real-World Evaluation of LLMs in Nutrition

Karen Jia-Hui Li, Simone Balloccu, Ondrej Dusek, Ehud Reiter

TL;DR

This study tackles the gap between intrinsic evaluations of LLMs and their real-world impact in nutrition by conducting the first extrinsic randomized controlled trial of an LLM-enhanced nutrition chatbot. The system augments a rule-based diet coach with an LLM-based rephrasing module and a fine-tuned nutritional counselling component, tested over seven weeks with 81 completers across three arms. Results show no consistent improvements in dietary adherence, emotional well-being, or engagement, despite favorable intrinsic evaluations of rephrasing and counselling. The findings underscore the need for rigorous real-world validation and human-centered design when applying LLMs to sensitive health domains, and they highlight the gaps between benchmark performance and meaningful outcomes in health interventions.

Abstract

The increasing trust in large language models (LLMs), especially in the form of chatbots, is often undermined by the lack of their extrinsic evaluation. This holds particularly true in nutrition, where randomised controlled trials (RCTs) are the gold standard, and experts demand them for evidence-based deployment. LLMs have shown promising results in this field, but these are limited to intrinsic setups. We address this gap by running the first RCT involving LLMs for nutrition. We augment a rule-based chatbot with two LLM-based features: (1) message rephrasing for conversational variety and engagement, and (2) nutritional counselling through a fine-tuned model. In our seven-week RCT (n=81), we compare chatbot variants with and without LLM integration. We measure effects on dietary outcome, emotional well-being, and engagement. Despite our LLM-based features performing well in intrinsic evaluation, we find that they did not yield consistent benefits in real-world deployment. These results highlight critical gaps between intrinsic evaluations and real-world impact, emphasising the need for interdisciplinary, human-centred approaches.\footnote{We provide all of our code and results at: \\ \href{https://github.com/saeshyra/diet-chatbot-trial}{https://github.com/saeshyra/diet-chatbot-trial}}

When LLMs Can't Help: Real-World Evaluation of LLMs in Nutrition

TL;DR

This study tackles the gap between intrinsic evaluations of LLMs and their real-world impact in nutrition by conducting the first extrinsic randomized controlled trial of an LLM-enhanced nutrition chatbot. The system augments a rule-based diet coach with an LLM-based rephrasing module and a fine-tuned nutritional counselling component, tested over seven weeks with 81 completers across three arms. Results show no consistent improvements in dietary adherence, emotional well-being, or engagement, despite favorable intrinsic evaluations of rephrasing and counselling. The findings underscore the need for rigorous real-world validation and human-centered design when applying LLMs to sensitive health domains, and they highlight the gaps between benchmark performance and meaningful outcomes in health interventions.

Abstract

The increasing trust in large language models (LLMs), especially in the form of chatbots, is often undermined by the lack of their extrinsic evaluation. This holds particularly true in nutrition, where randomised controlled trials (RCTs) are the gold standard, and experts demand them for evidence-based deployment. LLMs have shown promising results in this field, but these are limited to intrinsic setups. We address this gap by running the first RCT involving LLMs for nutrition. We augment a rule-based chatbot with two LLM-based features: (1) message rephrasing for conversational variety and engagement, and (2) nutritional counselling through a fine-tuned model. In our seven-week RCT (n=81), we compare chatbot variants with and without LLM integration. We measure effects on dietary outcome, emotional well-being, and engagement. Despite our LLM-based features performing well in intrinsic evaluation, we find that they did not yield consistent benefits in real-world deployment. These results highlight critical gaps between intrinsic evaluations and real-world impact, emphasising the need for interdisciplinary, human-centred approaches.\footnote{We provide all of our code and results at: \\ \href{https://github.com/saeshyra/diet-chatbot-trial}{https://github.com/saeshyra/diet-chatbot-trial}}

Paper Structure

This paper contains 24 sections, 25 figures, 12 tables.

Figures (25)

  • Figure 1: Overview of the chatbot architecture and functional flow. The BASELINE version uses the red flow only, REPHRASED adds the step marked in blue, and FULL adds the flow marked in purple. We provide an example of the insights flow in \ref{['fig:chatbot-outputs']} and the supportive text flow in \ref{['tab:counselling-outputs']}.
  • Figure 2: Examples of the chatbot outputs.
  • Figure 3: Initial prompt for message rephrasing.
  • Figure 4: Example of a problematic rephrased output from the initial rephrasing prompt (\ref{['fig:rephrasing-prompt-init']}), due to ambiguity in the original templated message responding to a user's request for advanced insights over a time period shorter than three days.
  • Figure 5: An example of the dynamic rephrasing. The context leading up to the intent of the templated chatbot output ("compare_no_dates" + no nutrient specified) is extracted from the NLU pipeline and dynamically added to the prompt, resulting in the rephrased output.
  • ...and 20 more figures