Table of Contents
Fetching ...

COACH meets QUORUM: A Framework and Pipeline for Aligning User, Expert and Developer Perspectives in LLM-generated Health Counselling

Yee Man Ng, Bram van Dijk, Pieter Beynen, Otto Boekesteijn, Joris Jansen, Gerard van Oortmerssen, Max van Duijn, Marco Spruit

TL;DR

QUORUM, a new evaluation framework that unifies developer-, expert-, and user-centric perspectives for consumer health language technologies, is presented and illustrated how a unified evaluation framework can support trustworthy, patient-centered NLP systems in real-world settings.

Abstract

Systems that collect data on sleep, mood, and activities can provide valuable lifestyle counselling to populations affected by chronic disease and its consequences. Such systems are, however, challenging to develop; besides reliably extracting patterns from user-specific data, systems should also contextualise these patterns with validated medical knowledge to ensure the quality of counselling, and generate counselling that is relevant to a real user. We present QUORUM, a new evaluation framework that unifies these developer-, expert-, and user-centric perspectives, and show with a real case study that it meaningfully tracks convergence and divergence in stakeholder perspectives. We also present COACH, a Large Language Model-driven pipeline to generate personalised lifestyle counselling for our Healthy Chronos use case, a diary app for cancer patients and survivors. Applying our framework shows that overall, users, medical experts, and developers converge on the opinion that the generated counselling is relevant, of good quality, and reliable. However, stakeholders also diverge on the tone of the counselling, sensitivity to errors in pattern-extraction, and potential hallucinations. These findings highlight the importance of multi-stakeholder evaluation for consumer health language technologies and illustrate how a unified evaluation framework can support trustworthy, patient-centered NLP systems in real-world settings.

COACH meets QUORUM: A Framework and Pipeline for Aligning User, Expert and Developer Perspectives in LLM-generated Health Counselling

TL;DR

QUORUM, a new evaluation framework that unifies developer-, expert-, and user-centric perspectives for consumer health language technologies, is presented and illustrated how a unified evaluation framework can support trustworthy, patient-centered NLP systems in real-world settings.

Abstract

Systems that collect data on sleep, mood, and activities can provide valuable lifestyle counselling to populations affected by chronic disease and its consequences. Such systems are, however, challenging to develop; besides reliably extracting patterns from user-specific data, systems should also contextualise these patterns with validated medical knowledge to ensure the quality of counselling, and generate counselling that is relevant to a real user. We present QUORUM, a new evaluation framework that unifies these developer-, expert-, and user-centric perspectives, and show with a real case study that it meaningfully tracks convergence and divergence in stakeholder perspectives. We also present COACH, a Large Language Model-driven pipeline to generate personalised lifestyle counselling for our Healthy Chronos use case, a diary app for cancer patients and survivors. Applying our framework shows that overall, users, medical experts, and developers converge on the opinion that the generated counselling is relevant, of good quality, and reliable. However, stakeholders also diverge on the tone of the counselling, sensitivity to errors in pattern-extraction, and potential hallucinations. These findings highlight the importance of multi-stakeholder evaluation for consumer health language technologies and illustrate how a unified evaluation framework can support trustworthy, patient-centered NLP systems in real-world settings.
Paper Structure (26 sections, 5 figures, 3 tables)

This paper contains 26 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Left: user interface for submitting queries. Right: generated counselling for query 'How can I sleep better?' Blue underlining illustrate claims about the user data; orange indicates contextualisation statements that contextualise found patterns with information from the knowledge database. Claims and statements are labelled with (1), (A), etc. but this is hidden from the user.
  • Figure 2: Overview of the pipeline flow and inputs.
  • Figure 3: User interface of HC.
  • Figure 4: Visualization of QUORUM outcomes. 'Averages' graph summarises relevance, quality, and reliability; 'Relevance' and 'Quality' summarize their respective dimensions by averaging over variable scores; 'Reliability' contains rescaled variable ratios (\ref{['tab:eval_framework']}.)
  • Figure 5: Examples of various errors in claims about data and contextualisation statements. A): overgeneralization of common-sense patterns: B) missing pattern in the user data; C): hallucination: advice is not strictly present in chunks.