DAIC-WOZ: On the Validity of Using the Therapist's prompts in Automatic Depression Detection from Clinical Interviews

Sergio Burdisso; Ernesto Reyes-Ramírez; Esaú Villatoro-Tello; Fernando Sánchez-Vega; Pastor López-Monroy; Petr Motlicek

DAIC-WOZ: On the Validity of Using the Therapist's prompts in Automatic Depression Detection from Clinical Interviews

Sergio Burdisso, Ernesto Reyes-Ramírez, Esaú Villatoro-Tello, Fernando Sánchez-Vega, Pastor López-Monroy, Petr Motlicek

TL;DR

The paper questions whether using an interviewer’s prompts in automatic depression detection leverages genuine diagnostic cues or dataset biases. By performing ablation studies with Longformer-based BERT and a Graph Convolutional Network on the DAIC-WOZ corpus, it shows that Ellie's prompts can act as discriminative shortcuts, localizing cues to specific late-interview questions about past mental health experiences. The authors demonstrate that prompts alone can yield strong performance and that a simple ensemble of prompt- and participant-based analyses achieves a top-textual accuracy of $F1 = 0.90$. These findings highlight the need for bias-aware, interpretable AI in clinical interviews and urge careful evaluation when incorporating interviewer prompts into depression-detection models.

Abstract

Automatic depression detection from conversational data has gained significant interest in recent years. The DAIC-WOZ dataset, interviews conducted by a human-controlled virtual agent, has been widely used for this task. Recent studies have reported enhanced performance when incorporating interviewer's prompts into the model. In this work, we hypothesize that this improvement might be mainly due to a bias present in these prompts, rather than the proposed architectures and methods. Through ablation experiments and qualitative analysis, we discover that models using interviewer's prompts learn to focus on a specific region of the interviews, where questions about past experiences with mental health issues are asked, and use them as discriminative shortcuts to detect depressed participants. In contrast, models using participant responses gather evidence from across the entire interview. Finally, to highlight the magnitude of this bias, we achieve a 0.90 F1 score by intentionally exploiting it, the highest result reported to date on this dataset using only textual information. Our findings underline the need for caution when incorporating interviewers' prompts into models, as they may inadvertently learn to exploit targeted prompts, rather than learning to characterize the language and behavior that are genuinely indicative of the patient's mental health condition.

DAIC-WOZ: On the Validity of Using the Therapist's prompts in Automatic Depression Detection from Clinical Interviews

TL;DR

. These findings highlight the need for bias-aware, interpretable AI in clinical interviews and urge careful evaluation when incorporating interviewer prompts into depression-detection models.

Abstract

Paper Structure (14 sections, 3 equations, 2 figures, 4 tables)

This paper contains 14 sections, 3 equations, 2 figures, 4 tables.

Introduction
The DAIC-WOZ Dataset
Methodology
Experiments and Results
Analysis and Discussion
Implications in Clinical Practice
Conclusions
Ethical Considerations
Limitations
Technical details
Graph Convolutional Network
Graph Convolutional Network
Longformer BERT
Implementation details

Figures (2)

Figure 1: Heatmaps illustrating the distribution of learned keywords by each model across the progression of each interview. The x-axis represents individual interviews, while the y-axis denotes the percentage of the conversation from the beginning (0%) to the end (100%). The white vertical line in each plot indicates the training and evaluation splits respectively. Finally, in the E-GCN evaluation split region, the small red rectangle depicts the interview segment showed in Fig. \ref{['fig:interview-example']}.
Figure 2: Illustrative segment from interview "381" in the evaluation set, highlighted in Figure \ref{['fig:heatmap']}. Conversation turns are color-coded based on the proportion of keywords present, with keywords underlined for emphasis.

DAIC-WOZ: On the Validity of Using the Therapist's prompts in Automatic Depression Detection from Clinical Interviews

TL;DR

Abstract

DAIC-WOZ: On the Validity of Using the Therapist's prompts in Automatic Depression Detection from Clinical Interviews

Authors

TL;DR

Abstract

Table of Contents

Figures (2)