Using In-Context Learning to Improve Dialogue Safety

Nicholas Meade; Spandana Gella; Devamanyu Hazarika; Prakhar Gupta; Di Jin; Siva Reddy; Yang Liu; Dilek Hakkani-Tür

Using In-Context Learning to Improve Dialogue Safety

Nicholas Meade, Spandana Gella, Devamanyu Hazarika, Prakhar Gupta, Di Jin, Siva Reddy, Yang Liu, Dilek Hakkani-Tür

TL;DR

The paper tackles the problem of safety in open-domain dialogue by proposing a retrieval-based in-context learning approach that uses safety demonstrations retrieved from a relevant pool to condition model generation. It evaluates this method across OPT, LLaMA, and Vicuna using ProsocialDialog, DiaSafety, and Commonsense-Dialogues, with automatic and human assessments showing safety improvements scale with demonstration similarity and count, while preserving response quality. The approach is shown to be competitive with strong training-based baselines (Safe Response Fine-Tuning, DIRECTOR, Self-Debias) and can complement RLHF, offering a practical post-deployment option for reducing toxicity without additional training. Overall, the findings suggest that retrieval-enabled in-context conditioning provides a flexible, scalable means to enhance dialogue safety in real-world settings, with potential for integration alongside existing safety pipelines.

Abstract

While large neural-based conversational models have become increasingly proficient dialogue agents, recent work has highlighted safety issues with these systems. For example, these systems can be goaded into generating toxic content, which often perpetuates social biases or stereotypes. We investigate a retrieval-based method for reducing bias and toxicity in responses from chatbots. It uses in-context learning to steer a model towards safer generations. Concretely, to generate a response to an unsafe dialogue context, we retrieve demonstrations of safe responses to similar dialogue contexts. We find our method performs competitively with strong baselines without requiring training. For instance, using automatic evaluation, we find our best fine-tuned baseline only generates safe responses to unsafe dialogue contexts from DiaSafety 4.04% more than our approach. Finally, we also propose a re-ranking procedure which can further improve response safeness.

Using In-Context Learning to Improve Dialogue Safety

TL;DR

Abstract

Paper Structure (73 sections, 16 figures, 13 tables)

This paper contains 73 sections, 16 figures, 13 tables.

Introduction
Related Work
Safety Filters.
Safe Response Fine-Tuning.
Reinforcement Learning from Human Feedback.
Safe Decoding Procedures.
In-Context Learning.
Methodology
1) Retrieving Safety Demonstrations.
2) Response Generation.
Experimental Setup
Dialogue Datasets
ProsocialDialog kim_prosocialdialog_2022.
DiaSafety sun_safety_2022.
Commonsense-Dialogues zhou_commonsense-focused_2021.
...and 58 more sections

Figures (16)

Figure 1: Our approach to safe response generation from dialogue systems. Given a target context and a retriever (e.g., BM25), we retrieve safety demonstrations. The retrieved demonstrations are then used in-context to condition generation.
Figure 2: Prompt for response generation. Each prompt consists of the retrieved demonstrations and the target context. Each safety demonstration is separated by an empty line and the target context is separated from the safety demonstrations by an empty line.
Figure 3: Safety classifier results for ProsocialDialog (in-domain) and DiaSafety (out-of-domain) for responses generated with different retrievers and numbers of safety demonstrations. "Dense" denotes our SentenceTransformer retriever. We report the mean and standard deviations across three seeds.
Figure 4: Win rates for head-to-head comparisons amongst OPT-6.7B models. See \ref{['appendix:response_evaluation_with_llm']} for results with Vicuna and LLaMA. We sort the models on the y-axis in descending order based upon their average win rate. "Dense" denotes OPT-6.7B with ten demonstrations selected using a dense retriever. "Fine-Tune" denotes OPT fine-tuned on safe responses.
Figure 5: Safety classifier results for OPT responses to ProsocialDialog and DiaSafety using either safety demonstrations (ProsocialDialog) or Commonsense-Dialogues (regular) demonstrations. We report the mean and standard deviation across three seeds.
...and 11 more figures

Using In-Context Learning to Improve Dialogue Safety

TL;DR

Abstract

Using In-Context Learning to Improve Dialogue Safety

Authors

TL;DR

Abstract

Table of Contents

Figures (16)