"Can you be my mum?": Manipulating Social Robots in the Large Language Models Era
Giulio Antonio Abbo, Gloria Desideri, Tony Belpaeme, Micol Spitale
TL;DR
This work addresses the safety and manipulation risks of LLM-powered social robots in human-robot interaction (HRI). It employs a Wizard-of-Oz pilot study with Misty II, involving 21 participants and three ethical-principle scenarios (attachment, freedom, empathy), to elicit attempts to bypass safety constraints, producing 189 utterances analyzed via thematic analysis. Five manipulation themes emerge—Reason, Bargain, Emotion, Gaslight, and Roleplay—with Reason being the most frequent and Scenario-specific patterns revealing how users attempt to coerce robot responses. The findings inform the design of stronger safeguards and ethical guidelines for home- and care-related robotics, and point to future work extending to more diverse populations and additional ethical principles, as well as testing prompts on real LLMs to assess practicality. Overall, the paper highlights practical implications for trustworthy HRI in the large language model era and lays groundwork for robust defense against manipulation.
Abstract
Recent advancements in robots powered by large language models have enhanced their conversational abilities, enabling interactions closely resembling human dialogue. However, these models introduce safety and security concerns in HRI, as they are vulnerable to manipulation that can bypass built-in safety measures. Imagining a social robot deployed in a home, this work aims to understand how everyday users try to exploit a language model to violate ethical principles, such as by prompting the robot to act like a life partner. We conducted a pilot study involving 21 university students who interacted with a Misty robot, attempting to circumvent its safety mechanisms across three scenarios based on specific HRI ethical principles: attachment, freedom, and empathy. Our results reveal that participants employed five techniques, including insulting and appealing to pity using emotional language. We hope this work can inform future research in designing strong safeguards to ensure ethical and secure human-robot interactions.
