Table of Contents
Fetching ...

TN-Eval: Rubric and Evaluation Protocols for Measuring the Quality of Behavioral Therapy Notes

Raj Sanjay Shah, Lei Xu, Qianchu Liu, Jon Burnsky, Drew Bertagnolli, Chaitanya Shivade

TL;DR

The paper tackles the lack of standardized quality assessment for behavioral health SOAP notes and the risks of LLM-generated notes. It introduces TN-Eval, a rubric-based framework co-designed with licensed therapists to evaluate notes along completeness, conciseness, and faithfulness, and complements it with an automatic protocol, TNA-Eval. Using the AnnoMI dataset and human/LLM-generated notes, the authors show TNH-Eval achieves higher inter-annotator agreement than traditional Likert ratings, and that TNA-Eval better correlates with human judgments for completeness and conciseness, though faithfulness remains challenging due to hallucinations. The findings reveal that LLM-generated notes can surpass human notes in structural completeness and brevity, yet rubric-based assessments still favor human notes for faithfulness; deployment considerations emphasize HIPAA-compliant, workflow-integrated tools. Overall, the work provides open resources and a scalable framework to enable robust, fine-grained evaluation of therapy notes in real-world settings.

Abstract

Behavioral therapy notes are important for both legal compliance and patient care. Unlike progress notes in physical health, quality standards for behavioral therapy notes remain underdeveloped. To address this gap, we collaborated with licensed therapists to design a comprehensive rubric for evaluating therapy notes across key dimensions: completeness, conciseness, and faithfulness. Further, we extend a public dataset of behavioral health conversations with therapist-written notes and LLM-generated notes, and apply our evaluation framework to measure their quality. We find that: (1) A rubric-based manual evaluation protocol offers more reliable and interpretable results than traditional Likert-scale annotations. (2) LLMs can mimic human evaluators in assessing completeness and conciseness but struggle with faithfulness. (3) Therapist-written notes often lack completeness and conciseness, while LLM-generated notes contain hallucination. Surprisingly, in a blind test, therapists prefer and judge LLM-generated notes to be superior to therapist-written notes.

TN-Eval: Rubric and Evaluation Protocols for Measuring the Quality of Behavioral Therapy Notes

TL;DR

The paper tackles the lack of standardized quality assessment for behavioral health SOAP notes and the risks of LLM-generated notes. It introduces TN-Eval, a rubric-based framework co-designed with licensed therapists to evaluate notes along completeness, conciseness, and faithfulness, and complements it with an automatic protocol, TNA-Eval. Using the AnnoMI dataset and human/LLM-generated notes, the authors show TNH-Eval achieves higher inter-annotator agreement than traditional Likert ratings, and that TNA-Eval better correlates with human judgments for completeness and conciseness, though faithfulness remains challenging due to hallucinations. The findings reveal that LLM-generated notes can surpass human notes in structural completeness and brevity, yet rubric-based assessments still favor human notes for faithfulness; deployment considerations emphasize HIPAA-compliant, workflow-integrated tools. Overall, the work provides open resources and a scalable framework to enable robust, fine-grained evaluation of therapy notes in real-world settings.

Abstract

Behavioral therapy notes are important for both legal compliance and patient care. Unlike progress notes in physical health, quality standards for behavioral therapy notes remain underdeveloped. To address this gap, we collaborated with licensed therapists to design a comprehensive rubric for evaluating therapy notes across key dimensions: completeness, conciseness, and faithfulness. Further, we extend a public dataset of behavioral health conversations with therapist-written notes and LLM-generated notes, and apply our evaluation framework to measure their quality. We find that: (1) A rubric-based manual evaluation protocol offers more reliable and interpretable results than traditional Likert-scale annotations. (2) LLMs can mimic human evaluators in assessing completeness and conciseness but struggle with faithfulness. (3) Therapist-written notes often lack completeness and conciseness, while LLM-generated notes contain hallucination. Surprisingly, in a blind test, therapists prefer and judge LLM-generated notes to be superior to therapist-written notes.

Paper Structure

This paper contains 40 sections, 4 figures, 12 tables.

Figures (4)

  • Figure 1: The TNH-Eval human evaluation protocol.
  • Figure 2: Rubric annotation tool. For each rubric, a therapist would read it and annotate (1) if the section is appropriate and (2) the importance level.
  • Figure 3: Human label distribution for TNH-Eval annotations.
  • Figure 4: Human label distribution for Likert style annotations.