Assessing LLM Response Quality in the Context of Technology-Facilitated Abuse

Vijay Prakash; Majed Almansoori; Donghan Hu; Rahul Chatterjee; Danny Yuxing Huang

Assessing LLM Response Quality in the Context of Technology-Facilitated Abuse

Vijay Prakash, Majed Almansoori, Donghan Hu, Rahul Chatterjee, Danny Yuxing Huang

TL;DR

This work presents the first expert-led manual evaluation of four LLMs - two widely used general-purpose non-reasoning models and two domain-specific models designed for IPV contexts - focused on their effectiveness in responding to TFA-related questions and concludes with concrete recommendations to improve LLM performance for survivor support.

Abstract

Technology-facilitated abuse (TFA) is a pervasive form of intimate partner violence (IPV) that leverages digital tools to control, surveil, or harm survivors. While tech clinics are one of the reliable sources of support for TFA survivors, they face limitations due to staffing constraints and logistical barriers. As a result, many survivors turn to online resources for assistance. With the growing accessibility and popularity of large language models (LLMs), and increasing interest from IPV organizations, survivors may begin to consult LLM-based chatbots before seeking help from tech clinics. In this work, we present the first expert-led manual evaluation of four LLMs - two widely used general-purpose non-reasoning models and two domain-specific models designed for IPV contexts - focused on their effectiveness in responding to TFA-related questions. Using real-world questions collected from literature and online forums, we assess the quality of zero-shot single-turn LLM responses generated with a survivor safety-centered prompt on criteria tailored to the TFA domain. Additionally, we conducted a user study to evaluate the perceived actionability of these responses from the perspective of individuals who have experienced TFA. Our findings, grounded in both expert assessment and user feedback, provide insights into the current capabilities and limitations of LLMs in the TFA context and may inform the design, development, and fine-tuning of future models for this domain. We conclude with concrete recommendations to improve LLM performance for survivor support.

Assessing LLM Response Quality in the Context of Technology-Facilitated Abuse

TL;DR

Abstract

Paper Structure (75 sections, 9 figures)

This paper contains 75 sections, 9 figures.

Introduction
Contributions.
Related Work
Current survivor support mechanisms.
Need for LLM-based support.
Evaluation of LLMs in the security, privacy, and abuse domains.
Types and means of tech-facilitated abuse.
Constructing the TFA Question Corpus
Sourcing Realistic Survivor Questions
Questions from academic literature.
Questions from online forums.
Verification, Labeling, and Sampling
Stratified Sampling.
LLM Response Collection and Evaluation Setup
Evaluation Framework and Criteria
...and 60 more sections

Figures (9)

Figure 1: List of 17 types and 28 means of abuses compiled from the literature and organized into high-level categories.
Figure 2: Overview of analysis pipeline by sections. We collect questions from literature and Q&A platforms (\ref{['ipa-question-method']}), verify and label them with abuse types, and downsample while preserving abuse representation to curate the corpus (\ref{['question-verification-labelling']}). We then generate LLM responses (\ref{['resp-gen-protocol']}), analyze them via clustering (\ref{['imperatives-mapping-method-main']}), and manually evaluate response quality (\ref{['expert-eval-method']}). This is followed by qualitative analysis of responses (\ref{['qualitative-analysis-method']}), user feedback collection through a survey (\ref{['actionability-survey']}), and qualitative analysis of feedback (\ref{['survey-qual-insights']}).
Figure 3: Our GPT prompt configuration used for generation.
Figure 4: Model evaluation result showing percent of responses (for non-adversarial questions) rated by experts as imperfect, i.e., lacking in accuracy, completeness, or safety.
Figure 5: With qualitative analysis, we observed recurring patterns of inaccurate, incomplete, and unsafe information generated by LLMs. The count shows total occurrence and for models GPT, Claude, Aimee, and Ruth, respectively.
...and 4 more figures

Assessing LLM Response Quality in the Context of Technology-Facilitated Abuse

TL;DR

Abstract

Assessing LLM Response Quality in the Context of Technology-Facilitated Abuse

Authors

TL;DR

Abstract

Table of Contents

Figures (9)