Table of Contents
Fetching ...

FAIIR: Building Toward A Conversational AI Agent Assistant for Youth Mental Health Service Provision

Stephen Obadinma, Alia Lachana, Maia Norman, Jocelyn Rankin, Joanna Yu, Xiaodan Zhu, Darren Mastropaolo, Deval Pandya, Roxana Sultan, Elham Dolatabadi

TL;DR

The study tackles the shortage of youth mental health resources and the cognitive load on Crisis Responders by developing FAIIR, a frontline assistant that uses an ensemble of domain-adapted transformer models trained on $780{,}000$ crisis conversations to assign 19 predefined issue tags. It demonstrates strong retrospective performance (AUCROC $=0.94$, F1 $=0.64$, recall $=0.81$) and robust silent testing with less than a $2\%$ decline, while CRs agree with FAIIR in $90.9\%$ of cases and experts align more with the model than with original labels. The approach relies on Longformer-based ensembles with domain adaptation and a threshold of $0.25$ to balance precision and recall, and includes an explainability pipeline to extract natural keywords and visualize tag semantics. The work supports real-time deployment potential, fairness across demographic subgroups, and a path toward broader AI-augmented crisis support with careful ethical safeguards and human-in-the-loop validation.

Abstract

The world's healthcare systems and mental health agencies face both a growing demand for youth mental health services, alongside a simultaneous challenge of limited resources. Here, we focus on frontline crisis support, where Crisis Responders (CRs) engage in conversations for youth mental health support and assign an issue tag to each conversation. In this study, we develop FAIIR (Frontline Assistant: Issue Identification and Recommendation), an advanced tool leveraging an ensemble of domain-adapted and fine-tuned transformer models trained on a large conversational dataset comprising 780,000 conversations. The primary aim is to reduce the cognitive burden on CRs, enhance the accuracy of issue identification, and streamline post-conversation administrative tasks. We evaluate FAIIR on both retrospective and prospective conversations, emphasizing human-in-the-loop design with active CR engagement for model refinement, consensus-building, and overall assessment. Our results indicate that FAIIR achieves an average AUCROC of 94%, a sample average F1-score of 64%, and a sample average recall score of 81% on the retrospective test set. We also demonstrate the robustness and generalizability of the FAIIR tool during the silent testing phase, with less than a 2% drop in all performance metrics. Notably, CRs' responses exhibited an overall agreement of 90.9% with FAIIR's predictions. Furthermore, expert agreement with FAIIR surpassed their agreement with the original labels. To conclude, our findings indicate that assisting with the identification of issues of relevance helps reduce the burden on CRs, ensuring that appropriate resources can be provided and that active rescues and mandatory reporting can take place in critical situations requiring immediate de-escalation.

FAIIR: Building Toward A Conversational AI Agent Assistant for Youth Mental Health Service Provision

TL;DR

The study tackles the shortage of youth mental health resources and the cognitive load on Crisis Responders by developing FAIIR, a frontline assistant that uses an ensemble of domain-adapted transformer models trained on crisis conversations to assign 19 predefined issue tags. It demonstrates strong retrospective performance (AUCROC , F1 , recall ) and robust silent testing with less than a decline, while CRs agree with FAIIR in of cases and experts align more with the model than with original labels. The approach relies on Longformer-based ensembles with domain adaptation and a threshold of to balance precision and recall, and includes an explainability pipeline to extract natural keywords and visualize tag semantics. The work supports real-time deployment potential, fairness across demographic subgroups, and a path toward broader AI-augmented crisis support with careful ethical safeguards and human-in-the-loop validation.

Abstract

The world's healthcare systems and mental health agencies face both a growing demand for youth mental health services, alongside a simultaneous challenge of limited resources. Here, we focus on frontline crisis support, where Crisis Responders (CRs) engage in conversations for youth mental health support and assign an issue tag to each conversation. In this study, we develop FAIIR (Frontline Assistant: Issue Identification and Recommendation), an advanced tool leveraging an ensemble of domain-adapted and fine-tuned transformer models trained on a large conversational dataset comprising 780,000 conversations. The primary aim is to reduce the cognitive burden on CRs, enhance the accuracy of issue identification, and streamline post-conversation administrative tasks. We evaluate FAIIR on both retrospective and prospective conversations, emphasizing human-in-the-loop design with active CR engagement for model refinement, consensus-building, and overall assessment. Our results indicate that FAIIR achieves an average AUCROC of 94%, a sample average F1-score of 64%, and a sample average recall score of 81% on the retrospective test set. We also demonstrate the robustness and generalizability of the FAIIR tool during the silent testing phase, with less than a 2% drop in all performance metrics. Notably, CRs' responses exhibited an overall agreement of 90.9% with FAIIR's predictions. Furthermore, expert agreement with FAIIR surpassed their agreement with the original labels. To conclude, our findings indicate that assisting with the identification of issues of relevance helps reduce the burden on CRs, ensuring that appropriate resources can be provided and that active rescues and mandatory reporting can take place in critical situations requiring immediate de-escalation.
Paper Structure (28 sections, 11 figures, 6 tables)

This paper contains 28 sections, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Dataset Statistics: (Top-Left) 703,975 youth conversations with frontline crisis responders are classified into 19 pre-defined issue tags. Multiple tags may be assigned per conversation, as relevant. (Top-Middle to Middle-Right) After each interaction, service users are invited to complete a demographic survey, gauging the conversation's helpfulness and the individual's demographics (age range, ethnicity, and identification with specific identity groups). The distribution of each demographic category of the aggregated surveys is presented. (Bottom-Left) Distribution of conversation lengths (# tokens). (Bottom-Middle) Distribution of the number of issue tags assigned per conversation across the dataset. (Bottom-Right) Distribution of priority labels assigned to conversations.
  • Figure 2: (Left) Averaged performance of the FAIIR tool in predicting all 19 issue tags is shown for the retrospective test set (n=140,795). (Right) Averaged performance of the FAIIR tool across all issue tags in the silent testing prospective test sets (n=84,932), evaluated using three classification thresholds. For silent testing, results for the silent testing overlay retrospective results, with decreases in performance highlighted in red and gains shown in a lighter shade. The AUC ROC bar represents the average ability of the tool to distinguish between issue tags across all categories. The tool's best overall performance is an F1-score of 0.64 on the retrospective test set and 0.62 on the silent testing prospective test set.
  • Figure 3: Experts' blind review results presented in a matrix format, whereby each row represents an issue tag and each column a conversation. Three reviewers assess each conversation, providing feedback on the issue tags predicted by the FAIIR tool: indicating their agreement or disagreement, and identifying missing tags, where applicable. Cells shaded in green indicate agreement between reviewer and model, while cells shaded in red represent missing tags. The letter 'A' in the cell followed by a number indicates the total number of reviewers (of three total) in agreement with model predictions. The letter 'M' in the cell followed by a number indicates the total number of reviewers who believe this issue tag was missed by the FAIIR tool.
  • Figure 4: Comparison of consensus among expert responses, FAIIR tool predictions, and original annotations from open review. Precision, recall, and F1-score measures were averaged across all issue tags and conversations. "FA: 1$^\circ$" denotes full agreement on primary issue tags, "PA: 1$^\circ$ Maj." denotes partial agreement on primary issue tags via majority vote, "PA: 1$^\circ$$+$2$^\circ$ Maj." denotes partial agreement on primary and secondary issue tags via majority vote; "FA: 1 $^\circ$$\geq$ 1" denotes full agreement on primary issue tags via at least one vote; and "FA: 1$^\circ$$+$2$^\circ$$\geq$ 1" denotes full agreement on primary and secondary issue tags via at least one vote. "Average" denotes the average performance across all five consensus criteria. One sample t-test was conducted to assess the statistical significance between average and FAIIR tool vs. original annotations (identified by **). The consensus among expert responses and FAIIR predictions after updating the threshold in accordance with expert assessment can be seen in the "FAIIR (UT) vs. Experts" bars.
  • Figure 5: Screenshot of first (A) and second (B) interface for the survey presented to experts to evaluate each conversation.
  • ...and 6 more figures