When LLMs Meets Acoustic Landmarks: An Efficient Approach to Integrate Speech into Large Language Models for Depression Detection

Xiangyu Zhang; Hexin Liu; Kaishuai Xu; Qiquan Zhang; Daijiao Liu; Beena Ahmed; Julien Epps

When LLMs Meets Acoustic Landmarks: An Efficient Approach to Integrate Speech into Large Language Models for Depression Detection

Xiangyu Zhang, Hexin Liu, Kaishuai Xu, Qiquan Zhang, Daijiao Liu, Beena Ahmed, Julien Epps

TL;DR

The paper addresses depression detection by bridging speech and language modeling through acoustic landmarks integrated into large language models (LLMs). It introduces a three-stage pipeline: (1) discrete landmark extraction from speech, (2) cross-modal instruction fine-tuning to teach LLMs about landmarks, and (3) P-Tuning for depression classification, leveraging LoRA adapters and data augmentation. The approach achieves state-of-the-art results on the DAIC-WOZ benchmark, notably reaching an F1 of about 0.84 through ensemble learning, and demonstrates that landmarks alone are insufficient but potent when combined with text. The work provides a resource-efficient pathway for speech-aware LLMs in mental health, offers insights into how LLMs represent acoustic landmarks, and highlights practical considerations for deploying multimodal depression detection systems.

Abstract

Depression is a critical concern in global mental health, prompting extensive research into AI-based detection methods. Among various AI technologies, Large Language Models (LLMs) stand out for their versatility in mental healthcare applications. However, their primary limitation arises from their exclusive dependence on textual input, which constrains their overall capabilities. Furthermore, the utilization of LLMs in identifying and analyzing depressive states is still relatively untapped. In this paper, we present an innovative approach to integrating acoustic speech information into the LLMs framework for multimodal depression detection. We investigate an efficient method for depression detection by integrating speech signals into LLMs utilizing Acoustic Landmarks. By incorporating acoustic landmarks, which are specific to the pronunciation of spoken words, our method adds critical dimensions to text transcripts. This integration also provides insights into the unique speech patterns of individuals, revealing the potential mental states of individuals. Evaluations of the proposed approach on the DAIC-WOZ dataset reveal state-of-the-art results when compared with existing Audio-Text baselines. In addition, this approach is not only valuable for the detection of depression but also represents a new perspective in enhancing the ability of LLMs to comprehend and process speech signals.

When LLMs Meets Acoustic Landmarks: An Efficient Approach to Integrate Speech into Large Language Models for Depression Detection

TL;DR

Abstract

Paper Structure (28 sections, 16 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 28 sections, 16 equations, 5 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Large Language Models
Acoustic Landmarks
Automatic Depression Detection
Methodology
Overview
Landmarks Extraction and Data Preprocessing
Landmarks Extraction
Data Augmentation and Processing
Hint Cross-modal Instruction Fine-Tuning
P-Tuning for Depression Detection
Decision Making
Experiments
Experimental Setup
...and 13 more sections

Figures (5)

Figure 1: Example of Acoustic Landmark (2-gram concat landmark (g+p-), (s+p+), (p+,p-), ..., (g-b-)), Landmarks are extracted from abrupt changes in the speech signal. They can discretize speech into a series of tokens that possess linguistic significance.
Figure 2: Overview of LLM-Landmark Depression Detection Pipeline, broadly categorized into three stages: landmark detection (on the left), cross-modal instruction fine-tuning (in the middle), and P-tuning for depression detection (on the right).
Figure 3: Landmark Detection Filter
Figure 4: Evaluation loss for different configurations up to 4000 steps.
Figure 5: The top four images represent the LoRA matrices of the layers that contribute most significantly to the large language model's learning of landmarks. The bottom four images depict the LoRA matrices of the layers with the least contribution. As can be inferred from the graph's title, the feedforward layer is the primary contributor.

When LLMs Meets Acoustic Landmarks: An Efficient Approach to Integrate Speech into Large Language Models for Depression Detection

TL;DR

Abstract

When LLMs Meets Acoustic Landmarks: An Efficient Approach to Integrate Speech into Large Language Models for Depression Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (5)