Table of Contents
Fetching ...

Say What I Want: Towards the Dark Side of Neural Dialogue Models

Haochen Liu, Tyler Derr, Zitao Liu, Jiliang Tang

TL;DR

The paper investigates whether a black-box neural dialogue model can be steered to produce targeted outputs via crafted inputs. It introduces the Reverse Dialogue Generator, a Seq2Seq RL agent that, given a target response, outputs an input to elicit that response from a fixed dialogue model treated as an environment, optimizing via policy gradients with an embedding-based similarity reward. Experiments on a public Twitter-based Seq2Seq model show that RDG achieves high success rates in reaching targeted outputs, especially with beam search and on generated targets, highlighting a notable security risk for real-world chatbots. The work underscores the need for defenses against targeted output manipulation and points to future work on extending to other sequence models and addressing privacy concerns.

Abstract

Neural dialogue models have been widely adopted in various chatbot applications because of their good performance in simulating and generalizing human conversations. However, there exists a dark side of these models -- due to the vulnerability of neural networks, a neural dialogue model can be manipulated by users to say what they want, which brings in concerns about the security of practical chatbot services. In this work, we investigate whether we can craft inputs that lead a well-trained black-box neural dialogue model to generate targeted outputs. We formulate this as a reinforcement learning (RL) problem and train a Reverse Dialogue Generator which efficiently finds such inputs for targeted outputs. Experiments conducted on a representative neural dialogue model show that our proposed model is able to discover such desired inputs in a considerable portion of cases. Overall, our work reveals this weakness of neural dialogue models and may prompt further researches of developing corresponding solutions to avoid it.

Say What I Want: Towards the Dark Side of Neural Dialogue Models

TL;DR

The paper investigates whether a black-box neural dialogue model can be steered to produce targeted outputs via crafted inputs. It introduces the Reverse Dialogue Generator, a Seq2Seq RL agent that, given a target response, outputs an input to elicit that response from a fixed dialogue model treated as an environment, optimizing via policy gradients with an embedding-based similarity reward. Experiments on a public Twitter-based Seq2Seq model show that RDG achieves high success rates in reaching targeted outputs, especially with beam search and on generated targets, highlighting a notable security risk for real-world chatbots. The work underscores the need for defenses against targeted output manipulation and points to future work on extending to other sequence models and addressing privacy concerns.

Abstract

Neural dialogue models have been widely adopted in various chatbot applications because of their good performance in simulating and generalizing human conversations. However, there exists a dark side of these models -- due to the vulnerability of neural networks, a neural dialogue model can be manipulated by users to say what they want, which brings in concerns about the security of practical chatbot services. In this work, we investigate whether we can craft inputs that lead a well-trained black-box neural dialogue model to generate targeted outputs. We formulate this as a reinforcement learning (RL) problem and train a Reverse Dialogue Generator which efficiently finds such inputs for targeted outputs. Experiments conducted on a representative neural dialogue model show that our proposed model is able to discover such desired inputs in a considerable portion of cases. Overall, our work reveals this weakness of neural dialogue models and may prompt further researches of developing corresponding solutions to avoid it.

Paper Structure

This paper contains 19 sections, 5 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The agent-environment setup of the proposed framework.
  • Figure 2: Success rates of the pre-trained model and the RL-trained model with different decoding methods. The upper row and the lower row show the results on the Generated target list and the Real target list with various lengths respectively.