Human-Aligned Faithfulness in Toxicity Explanations of LLMs
Ramaravind K. Mothilal, Joanna Roy, Syed Ishtiaque Ahmed, Shion Guha
TL;DR
This work introduces Human-Aligned Faithfulness (haf), a theoretically grounded, multi-dimensional framework derived from Informal Logic to evaluate LLMs' free-form toxicity explanations against rational human reasoning under ideal conditions. It operationalizes haf through six uncertainty-based metrics across a three-stage prompting pipeline (justify, uphold-reason, uphold-stance) to assess non-redundant relevance, post-hoc reliance, and individual sufficiency/necessity of reasons. Experiments across five toxicity datasets and several Llama and Ministral models reveal that while explanations can be plausible, they often fail to coherently justify or maintain their reasons, especially for nuanced connections between reasons and stances. The findings suggest a shift from mere toxicity detection to evaluating the reasoning process behind toxicity decisions, highlighting the limitations of current LLMs in socially critical contexts and calling for improved methods and evaluation in robust toxicapology. The authors provide open-source code and data to enable broader adoption and replication of haf-based evaluation.
Abstract
The discourse around toxicity and LLMs in NLP largely revolves around detection tasks. This work shifts the focus to evaluating LLMs' reasoning about toxicity -- from their explanations that justify a stance -- to enhance their trustworthiness in downstream tasks. Despite extensive research on explainability, it is not straightforward to adopt existing methods to evaluate free-form toxicity explanation due to their over-reliance on input text perturbations, among other challenges. To account for these, we propose a novel, theoretically-grounded multi-dimensional criterion, Human-Aligned Faithfulness (HAF), that measures the extent to which LLMs' free-form toxicity explanations align with those of a rational human under ideal conditions. We develop six metrics, based on uncertainty quantification, to comprehensively evaluate HAF of LLMs' toxicity explanations with no human involvement, and highlight how "non-ideal" the explanations are. We conduct several experiments on three Llama models (of size up to 70B) and an 8B Ministral model on five diverse toxicity datasets. Our results show that while LLMs generate plausible explanations to simple prompts, their reasoning about toxicity breaks down when prompted about the nuanced relations between the complete set of reasons, the individual reasons, and their toxicity stances, resulting in inconsistent and irrelevant responses. We open-source our code at https://github.com/uofthcdslab/HAF and LLM-generated explanations at https://huggingface.co/collections/uofthcdslab/haf.
