Table of Contents
Fetching ...

CONSCENDI: A Contrastive and Scenario-Guided Distillation Approach to Guardrail Models for Virtual Assistants

Albert Yu Sun, Varun Nair, Elliot Schumacher, Anitha Kannan

TL;DR

This work tackles the challenge of enforcing task-specific safety in LLM-driven virtual assistants by proposing CONSCENDI, a two-stage data-generation and distillation pipeline. By first generating scenario-guided violation contexts and then creating contrastive and non-violating conversations, CONSCENDI produces richly labeled data that is used to fine-tune compact guardrail models. The approach yields strong guardrail accuracy across three domains (Flights, Restaurants, Buses), with notable improvements in out-of-distribution generalization and substantial computational cost advantages over GPT-4. The findings highlight the value of scenario-guided data and contrastive training for designing controllable, efficient guardrails in real-world conversational systems.

Abstract

A wave of new task-based virtual assistants has been fueled by increasingly powerful large language models (LLMs), such as GPT-4 (OpenAI, 2023). A major challenge in deploying LLM-based virtual conversational assistants in real world settings is ensuring they operate within what is admissible for the task. To overcome this challenge, the designers of these virtual assistants rely on an independent guardrail system that verifies the virtual assistant's output aligns with the constraints required for the task. However, relying on commonly used, prompt-based guardrails can be difficult to engineer correctly and comprehensively. To address these challenges, we propose CONSCENDI. We use CONSCENDI to exhaustively generate training data with two key LLM-powered components: scenario-augmented generation and contrastive training examples. When generating conversational data, we generate a set of rule-breaking scenarios, which enumerate a diverse set of high-level ways a rule can be violated. This scenario-guided approach produces a diverse training set and provides chatbot designers greater control. To generate contrastive examples, we prompt the LLM to alter conversations with violations into acceptable conversations to enable fine-grained distinctions. We then use this data, generated by CONSCENDI, to train a smaller model. We find that CONSCENDI results in guardrail models that improve over baselines in multiple dialogue domains.

CONSCENDI: A Contrastive and Scenario-Guided Distillation Approach to Guardrail Models for Virtual Assistants

TL;DR

This work tackles the challenge of enforcing task-specific safety in LLM-driven virtual assistants by proposing CONSCENDI, a two-stage data-generation and distillation pipeline. By first generating scenario-guided violation contexts and then creating contrastive and non-violating conversations, CONSCENDI produces richly labeled data that is used to fine-tune compact guardrail models. The approach yields strong guardrail accuracy across three domains (Flights, Restaurants, Buses), with notable improvements in out-of-distribution generalization and substantial computational cost advantages over GPT-4. The findings highlight the value of scenario-guided data and contrastive training for designing controllable, efficient guardrails in real-world conversational systems.

Abstract

A wave of new task-based virtual assistants has been fueled by increasingly powerful large language models (LLMs), such as GPT-4 (OpenAI, 2023). A major challenge in deploying LLM-based virtual conversational assistants in real world settings is ensuring they operate within what is admissible for the task. To overcome this challenge, the designers of these virtual assistants rely on an independent guardrail system that verifies the virtual assistant's output aligns with the constraints required for the task. However, relying on commonly used, prompt-based guardrails can be difficult to engineer correctly and comprehensively. To address these challenges, we propose CONSCENDI. We use CONSCENDI to exhaustively generate training data with two key LLM-powered components: scenario-augmented generation and contrastive training examples. When generating conversational data, we generate a set of rule-breaking scenarios, which enumerate a diverse set of high-level ways a rule can be violated. This scenario-guided approach produces a diverse training set and provides chatbot designers greater control. To generate contrastive examples, we prompt the LLM to alter conversations with violations into acceptable conversations to enable fine-grained distinctions. We then use this data, generated by CONSCENDI, to train a smaller model. We find that CONSCENDI results in guardrail models that improve over baselines in multiple dialogue domains.
Paper Structure (33 sections, 3 equations, 2 figures, 19 tables)

This paper contains 33 sections, 3 equations, 2 figures, 19 tables.

Figures (2)

  • Figure 1: Example guardrail task. In this example, a virtual assistant in the restaurant domain provides information about an ongoing promotion to the user, thereby breaking rule 2. The guardrail model uses the last turn of the conversation (non-grayed text; in our models, we use the last two turns) to classify the last two turns as a rule violation (which rule) or no violation.
  • Figure 2: CONSCENDI. We finetune a GPT-3 model by 'distilling' GPT-4 through a focused data generation paradigm. We use GPT-4 to generate rule-specific scenarios (see Section \ref{['sec:syn_data']}). We generate three types of conversations for each scenario: 1. Violations: conversations that violate the rule, 2. Contrastive Nonviolations: conversations that are identical to our generated violations but replace the rule-violating turn with a non-violating turn, and 3. Nonviolations: conversations that don't violate any of our rules. These newly generated conversations are few-shot generated using example conversations from rastogi2020towards.