Say What I Want: Towards the Dark Side of Neural Dialogue Models
Haochen Liu, Tyler Derr, Zitao Liu, Jiliang Tang
TL;DR
The paper investigates whether a black-box neural dialogue model can be steered to produce targeted outputs via crafted inputs. It introduces the Reverse Dialogue Generator, a Seq2Seq RL agent that, given a target response, outputs an input to elicit that response from a fixed dialogue model treated as an environment, optimizing via policy gradients with an embedding-based similarity reward. Experiments on a public Twitter-based Seq2Seq model show that RDG achieves high success rates in reaching targeted outputs, especially with beam search and on generated targets, highlighting a notable security risk for real-world chatbots. The work underscores the need for defenses against targeted output manipulation and points to future work on extending to other sequence models and addressing privacy concerns.
Abstract
Neural dialogue models have been widely adopted in various chatbot applications because of their good performance in simulating and generalizing human conversations. However, there exists a dark side of these models -- due to the vulnerability of neural networks, a neural dialogue model can be manipulated by users to say what they want, which brings in concerns about the security of practical chatbot services. In this work, we investigate whether we can craft inputs that lead a well-trained black-box neural dialogue model to generate targeted outputs. We formulate this as a reinforcement learning (RL) problem and train a Reverse Dialogue Generator which efficiently finds such inputs for targeted outputs. Experiments conducted on a representative neural dialogue model show that our proposed model is able to discover such desired inputs in a considerable portion of cases. Overall, our work reveals this weakness of neural dialogue models and may prompt further researches of developing corresponding solutions to avoid it.
