ADAPT: Actively Discovering and Adapting to Preferences for any Task

Maithili Patel; Xavier Puig; Ruta Desai; Roozbeh Mottaghi; Sonia Chernova; Joanne Truong; Akshara Rai

ADAPT: Actively Discovering and Adapting to Preferences for any Task

Maithili Patel, Xavier Puig, Ruta Desai, Roozbeh Mottaghi, Sonia Chernova, Joanne Truong, Akshara Rai

TL;DR

This work introduces ADAPT, a benchmark for evaluating assistive agents on actively eliciting user preferences during long-horizon tasks via interactive questioning. It then proposes Reflection-DPO, a privileged-teacher training paradigm that teaches an LLM to follow preferences and ask informative questions when information is missing, using a reflection mechanism to generate candidate questions and Direct Preference Optimization for training. Empirical results show Reflection-DPO substantially improves preference satisfaction over baselines and generalizes to unseen users, though it does not yet reach the performance of a perfect teacher and tends to ask more questions than strictly necessary. The work advances open-world planning with user preferences by combining a grounded text-based environment, a large action space, and a learnable questioning strategy, with clear avenues for reducing question burden in future work.

Abstract

Assistive agents should be able to perform under-specified long-horizon tasks while respecting user preferences. We introduce Actively Discovering and Adapting to Preferences for any Task (ADAPT) -- a benchmark designed to evaluate agents' ability to adhere to user preferences across various household tasks through active questioning. Next, we propose Reflection-DPO, a novel training approach for adapting large language models (LLMs) to the task of active questioning. Reflection-DPO finetunes a 'student' LLM to follow the actions of a privileged 'teacher' LLM, and optionally ask a question to gather necessary information to better predict the teacher action. We find that prior approaches that use state-of-the-art LLMs fail to sufficiently follow user preferences in ADAPT due to insufficient questioning and poor adherence to elicited preferences. In contrast, Reflection-DPO achieves a higher rate of satisfying user preferences, outperforming a zero-shot chain-of-thought baseline by 6.1% on unseen users.

ADAPT: Actively Discovering and Adapting to Preferences for any Task

TL;DR

Abstract

ADAPT: Actively Discovering and Adapting to Preferences for any Task

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)