Interpreting User Requests in the Context of Natural Language Standing Instructions
Nikita Moghe, Patrick Xia, Jacob Andreas, Jason Eisner, Benjamin Van Durme, Harsh Jhamtani
TL;DR
This work introduces natural language standing instructions as persistent user preferences to condition LLM-driven dialogue, formalizing the task of selecting relevant instructions and interpreting them into API calls. It presents NLSI, a 2.4k-dialogue, 17-domain dataset that categorizes instruction-utterance interactions into six reasoning types and provides a suite of baselines combining selection and interpretation using prompting, retrieval, and memory-augmented strategies. Experimental results reveal substantial challenges: even with gold-standing-instruction selection (Oracle), interpretation remains far from perfect, and the best EM across methods hovers around the mid-40s with notable variance across reasoning types. The findings highlight the need for improved retrieval, reasoning, and structured parsing techniques, as well as memory-augmented approaches, to robustly leverage standing instructions in task-oriented dialogue with diverse domains.
Abstract
Users of natural language interfaces, generally powered by Large Language Models (LLMs),often must repeat their preferences each time they make a similar request. We describe an approach to LLM-based dialogue modeling in which persistent user constraints and preferences -- collectively termed standing instructions -- as additional context for such interfaces. For example, when a user states "I'm hungry", a previously expressed preference for Persian food can be automatically added to the LLM prompt, influencing the search for relevant restaurants. We develop NLSI, a language-to-program dataset consisting of over 2.4K dialogues spanning 17 domains, where each dialogue is paired with a user profile (a set of users specific standing instructions) and corresponding structured representations (API calls). A key challenge in NLSI is to identify which subset of the standing instructions is applicable to a given dialogue. NLSI contains diverse phenomena, from simple preferences to interdependent instructions such as triggering a hotel search whenever the user is booking tickets to an event. We conduct experiments on NLSI using prompting with large language models and various retrieval approaches, achieving a maximum of 44.7% exact match on API prediction. Our results demonstrate the challenges in identifying the relevant standing instructions and their interpretation into API calls.
