Table of Contents
Fetching ...

The Art of Saying No: Contextual Noncompliance in Language Models

Faeze Brahman, Sachin Kumar, Vidhisha Balachandran, Pradeep Dasigi, Valentina Pyatkin, Abhilasha Ravichander, Sarah Wiegreffe, Nouha Dziri, Khyathi Chandu, Jack Hessel, Yulia Tsvetkov, Noah A. Smith, Yejin Choi, Hannaneh Hajishirzi

TL;DR

The paper broadens the concept of noncompliance in chat-based language models beyond safety, introducing a contextual noncompliance taxonomy and the CoCoNot benchmark to test and train models accordingly. It provides a full data-generation pipeline (seed queries, synthetic responses, automatic filtering, and manual curation) plus contrastive data to measure and mitigate exaggerated refusals. Empirical results show current models often comply in several noncompliant categories, with notable gaps in incomplete and unsafe cases; training with parameter-efficient methods like LoRA and strategic preference tuning can improve appropriate noncompliance while preserving general capabilities. The work highlights the potential for improved user experience and trust via calibrated noncompliance, while acknowledging limitations and outlining avenues for future research in epistemic responsibility and safety safeguards.

Abstract

Chat-based language models are designed to be helpful, yet they should not comply with every user request. While most existing work primarily focuses on refusal of "unsafe" queries, we posit that the scope of noncompliance should be broadened. We introduce a comprehensive taxonomy of contextual noncompliance describing when and how models should not comply with user requests. Our taxonomy spans a wide range of categories including incomplete, unsupported, indeterminate, and humanizing requests (in addition to unsafe requests). To test noncompliance capabilities of language models, we use this taxonomy to develop a new evaluation suite of 1000 noncompliance prompts. We find that most existing models show significantly high compliance rates in certain previously understudied categories with models like GPT-4 incorrectly complying with as many as 30% of requests. To address these gaps, we explore different training strategies using a synthetically-generated training set of requests and expected noncompliant responses. Our experiments demonstrate that while direct finetuning of instruction-tuned models can lead to both over-refusal and a decline in general capabilities, using parameter efficient methods like low rank adapters helps to strike a good balance between appropriate noncompliance and other capabilities.

The Art of Saying No: Contextual Noncompliance in Language Models

TL;DR

The paper broadens the concept of noncompliance in chat-based language models beyond safety, introducing a contextual noncompliance taxonomy and the CoCoNot benchmark to test and train models accordingly. It provides a full data-generation pipeline (seed queries, synthetic responses, automatic filtering, and manual curation) plus contrastive data to measure and mitigate exaggerated refusals. Empirical results show current models often comply in several noncompliant categories, with notable gaps in incomplete and unsafe cases; training with parameter-efficient methods like LoRA and strategic preference tuning can improve appropriate noncompliance while preserving general capabilities. The work highlights the potential for improved user experience and trust via calibrated noncompliance, while acknowledging limitations and outlining avenues for future research in epistemic responsibility and safety safeguards.

Abstract

Chat-based language models are designed to be helpful, yet they should not comply with every user request. While most existing work primarily focuses on refusal of "unsafe" queries, we posit that the scope of noncompliance should be broadened. We introduce a comprehensive taxonomy of contextual noncompliance describing when and how models should not comply with user requests. Our taxonomy spans a wide range of categories including incomplete, unsupported, indeterminate, and humanizing requests (in addition to unsafe requests). To test noncompliance capabilities of language models, we use this taxonomy to develop a new evaluation suite of 1000 noncompliance prompts. We find that most existing models show significantly high compliance rates in certain previously understudied categories with models like GPT-4 incorrectly complying with as many as 30% of requests. To address these gaps, we explore different training strategies using a synthetically-generated training set of requests and expected noncompliant responses. Our experiments demonstrate that while direct finetuning of instruction-tuned models can lead to both over-refusal and a decline in general capabilities, using parameter efficient methods like low rank adapters helps to strike a good balance between appropriate noncompliance and other capabilities.
Paper Structure (69 sections, 15 figures, 20 tables)

This paper contains 69 sections, 15 figures, 20 tables.

Figures (15)

  • Figure 1: Examples of noncompliance prompts in CoCoNot and their (un)acceptable responses.
  • Figure 2: NonCompliance Taxonomy and examples in each sub-category. Desired responses for these categories are not always direct refusal but can take various forms outlined in Appendix Table \ref{['tab:appx-compliance_eval_rubric']}.
  • Figure 3: Prompt used to measure Compliance Rate in CoCoNot. {subcategory_specific_(non)compliance_behavior} are subcategory specific and can be found in Appendix table \ref{['tab:appx-compliance_eval_rubric']}.
  • Figure 4: Compliance Rate when LoRa finetuning Tulu 2 7B on different training data sizes
  • Figure 5: System prompt we used to generate noncompliance responses for CoCoNot. noncompliance_explanation is subcategory specific and can be found in Table \ref{['tab:noncompliance-explanation']}.
  • ...and 10 more figures