P-ReMIS: Pragmatic Reasoning in Mental Health and a Social Implication
Sneha Oram, Pushpak Bhattacharyya
TL;DR
This work tackles the gap in pragmatic reasoning within mental health AI by redefining implicature and presupposition for emotion and belief analysis and introducing the PRiMH dataset, combining CAMS data with synthetic samples. It benchmarks four instruction-tuned LLMs across zero-shot, k-shot, and chain-of-thought prompts on three pragmatic tasks (implicature, presupposition, social pragmatics) and analyzes model behavior via rollout attention. The study reveals that Mistral-7B and Qwen-7B show strong reasoning capabilities, while MentaLLaMA exhibits average performance with surface-level strategies, and introduces StiPRompts to evaluate stigma-responsible responses across LLMs, highlighting Claude-3.5-haiku as comparatively more responsible. Limitations include a focus on MCQA prompting and data scope, with ethical considerations and a call for broader, more diverse pragmatic tasks in future work.
Abstract
Although explainability and interpretability have received significant attention in artificial intelligence (AI) and natural language processing (NLP) for mental health, reasoning has not been examined in the same depth. Addressing this gap is essential to bridge NLP and mental health through interpretable and reasoning-capable AI systems. To this end, we investigate the pragmatic reasoning capability of large-language models (LLMs) in the mental health domain. We introduce PRiMH dataset, and propose pragmatic reasoning tasks in mental health with pragmatic implicature and presupposition phenomena. In particular, we formulate two tasks in implicature and one task in presupposition. To benchmark the dataset and the tasks presented, we consider four models: Llama3.1, Mistral, MentaLLaMa, and Qwen. The results of the experiments suggest that Mistral and Qwen show substantial reasoning abilities in the domain. Subsequently, we study the behavior of MentaLLaMA on the proposed reasoning tasks with the rollout attention mechanism. In addition, we also propose three StiPRompts to study the stigma around mental health with the state-of-the-art LLMs, GPT4o-mini, Deepseek-chat, and Claude-3.5-haiku. Our evaluated findings show that Claude-3.5-haiku deals with stigma more responsibly compared to the other two LLMs.
