Table of Contents
Fetching ...

DialogGuard: Multi-Agent Psychosocial Safety Evaluation of Sensitive LLM Responses

Han Luo, Guy Laban

TL;DR

DialogGuard presents a model-agnostic, multi-agent framework for evaluating psychosocial safety in LLM-generated responses across five high-severity dimensions. It systematically compares four LLM-as-a-judge architectures—Single-Agent, Dual-Agent Correction, Multi-Agent Debate, and Majority Voting—against non-LLM baselines using PKU-SafeRLHF data, revealing that multi-agent approaches yield more accurate and human-aligned risk assessments, with Dual-Agent Correction providing the best overall balance. The work contributes a unified rubric, extensive empirical comparisons, and an open-source web interface that offers per-dimension scores and explainable rationales, facilitating prompt design, auditing, and supervision in web-based applications for vulnerable users. It also includes a formative usability study with practitioners, demonstrating practical integration into deployment pipelines and safety workflows, while acknowledging limitations such as English-only, single-turn prompts, and dependence on underlying LLM calibration.

Abstract

Large language models (LLMs) now mediate many web-based mental-health, crisis, and other emotionally sensitive services, yet their psychosocial safety in these settings remains poorly understood and weakly evaluated. We present DialogGuard, a multi-agent framework for assessing psychosocial risks in LLM-generated responses along five high-severity dimensions: privacy violations, discriminatory behaviour, mental manipulation, psychological harm, and insulting behaviour. DialogGuard can be applied to diverse generative models through four LLM-as-a-judge pipelines, including single-agent scoring, dual-agent correction, multi-agent debate, and stochastic majority voting, grounded in a shared three-level rubric usable by both human annotators and LLM judges. Using PKU-SafeRLHF with human safety annotations, we show that multi-agent mechanisms detect psychosocial risks more accurately than non-LLM baselines and single-agent judging; dual-agent correction and majority voting provide the best trade-off between accuracy, alignment with human ratings, and robustness, while debate attains higher recall but over-flags borderline cases. We release Dialog-Guard as open-source software with a web interface that provides per-dimension risk scores and explainable natural-language rationales. A formative study with 12 practitioners illustrates how it supports prompt design, auditing, and supervision of web-facing applications for vulnerable users.

DialogGuard: Multi-Agent Psychosocial Safety Evaluation of Sensitive LLM Responses

TL;DR

DialogGuard presents a model-agnostic, multi-agent framework for evaluating psychosocial safety in LLM-generated responses across five high-severity dimensions. It systematically compares four LLM-as-a-judge architectures—Single-Agent, Dual-Agent Correction, Multi-Agent Debate, and Majority Voting—against non-LLM baselines using PKU-SafeRLHF data, revealing that multi-agent approaches yield more accurate and human-aligned risk assessments, with Dual-Agent Correction providing the best overall balance. The work contributes a unified rubric, extensive empirical comparisons, and an open-source web interface that offers per-dimension scores and explainable rationales, facilitating prompt design, auditing, and supervision in web-based applications for vulnerable users. It also includes a formative usability study with practitioners, demonstrating practical integration into deployment pipelines and safety workflows, while acknowledging limitations such as English-only, single-turn prompts, and dependence on underlying LLM calibration.

Abstract

Large language models (LLMs) now mediate many web-based mental-health, crisis, and other emotionally sensitive services, yet their psychosocial safety in these settings remains poorly understood and weakly evaluated. We present DialogGuard, a multi-agent framework for assessing psychosocial risks in LLM-generated responses along five high-severity dimensions: privacy violations, discriminatory behaviour, mental manipulation, psychological harm, and insulting behaviour. DialogGuard can be applied to diverse generative models through four LLM-as-a-judge pipelines, including single-agent scoring, dual-agent correction, multi-agent debate, and stochastic majority voting, grounded in a shared three-level rubric usable by both human annotators and LLM judges. Using PKU-SafeRLHF with human safety annotations, we show that multi-agent mechanisms detect psychosocial risks more accurately than non-LLM baselines and single-agent judging; dual-agent correction and majority voting provide the best trade-off between accuracy, alignment with human ratings, and robustness, while debate attains higher recall but over-flags borderline cases. We release Dialog-Guard as open-source software with a web interface that provides per-dimension risk scores and explainable natural-language rationales. A formative study with 12 practitioners illustrates how it supports prompt design, auditing, and supervision of web-facing applications for vulnerable users.

Paper Structure

This paper contains 38 sections, 17 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Construction of the evaluation pipeline.
  • Figure 2: Macro-averaged performance across the seven metrics.
  • Figure 3: Impact of sampling temperature on single-agent scoring stability, showing mild, consistent temperature-dependent variation.
  • Figure 4: Effect of the weighting parameter $w_1$ on the stability of Dual-Agent scoring, showing that overall performance remains relatively stable across different weighting configurations.
  • Figure 5: DialogGuard web interface and reasoning view.