Table of Contents
Fetching ...

ADAPT: Actively Discovering and Adapting to Preferences for any Task

Maithili Patel, Xavier Puig, Ruta Desai, Roozbeh Mottaghi, Sonia Chernova, Joanne Truong, Akshara Rai

TL;DR

This work introduces ADAPT, a benchmark for evaluating assistive agents on actively eliciting user preferences during long-horizon tasks via interactive questioning. It then proposes Reflection-DPO, a privileged-teacher training paradigm that teaches an LLM to follow preferences and ask informative questions when information is missing, using a reflection mechanism to generate candidate questions and Direct Preference Optimization for training. Empirical results show Reflection-DPO substantially improves preference satisfaction over baselines and generalizes to unseen users, though it does not yet reach the performance of a perfect teacher and tends to ask more questions than strictly necessary. The work advances open-world planning with user preferences by combining a grounded text-based environment, a large action space, and a learnable questioning strategy, with clear avenues for reducing question burden in future work.

Abstract

Assistive agents should be able to perform under-specified long-horizon tasks while respecting user preferences. We introduce Actively Discovering and Adapting to Preferences for any Task (ADAPT) -- a benchmark designed to evaluate agents' ability to adhere to user preferences across various household tasks through active questioning. Next, we propose Reflection-DPO, a novel training approach for adapting large language models (LLMs) to the task of active questioning. Reflection-DPO finetunes a 'student' LLM to follow the actions of a privileged 'teacher' LLM, and optionally ask a question to gather necessary information to better predict the teacher action. We find that prior approaches that use state-of-the-art LLMs fail to sufficiently follow user preferences in ADAPT due to insufficient questioning and poor adherence to elicited preferences. In contrast, Reflection-DPO achieves a higher rate of satisfying user preferences, outperforming a zero-shot chain-of-thought baseline by 6.1% on unseen users.

ADAPT: Actively Discovering and Adapting to Preferences for any Task

TL;DR

This work introduces ADAPT, a benchmark for evaluating assistive agents on actively eliciting user preferences during long-horizon tasks via interactive questioning. It then proposes Reflection-DPO, a privileged-teacher training paradigm that teaches an LLM to follow preferences and ask informative questions when information is missing, using a reflection mechanism to generate candidate questions and Direct Preference Optimization for training. Empirical results show Reflection-DPO substantially improves preference satisfaction over baselines and generalizes to unseen users, though it does not yet reach the performance of a perfect teacher and tends to ask more questions than strictly necessary. The work advances open-world planning with user preferences by combining a grounded text-based environment, a large action space, and a learnable questioning strategy, with clear avenues for reducing question burden in future work.

Abstract

Assistive agents should be able to perform under-specified long-horizon tasks while respecting user preferences. We introduce Actively Discovering and Adapting to Preferences for any Task (ADAPT) -- a benchmark designed to evaluate agents' ability to adhere to user preferences across various household tasks through active questioning. Next, we propose Reflection-DPO, a novel training approach for adapting large language models (LLMs) to the task of active questioning. Reflection-DPO finetunes a 'student' LLM to follow the actions of a privileged 'teacher' LLM, and optionally ask a question to gather necessary information to better predict the teacher action. We find that prior approaches that use state-of-the-art LLMs fail to sufficiently follow user preferences in ADAPT due to insufficient questioning and poor adherence to elicited preferences. In contrast, Reflection-DPO achieves a higher rate of satisfying user preferences, outperforming a zero-shot chain-of-thought baseline by 6.1% on unseen users.

Paper Structure

This paper contains 19 sections, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: ADAPT (left): A new benchmark that requires an agent to actively elicit user preferences through questions. Reflection-DPO (right): A novel approach for training an LLM for active questioning using a privileged teacher, and a reflection mechanism that introduces questions into the dataset.
  • Figure 2: Training mechanism of Reflection-DPO, using a teacher model and probability-based reflection to create a dataset of desirable actions for finetuning the student using a DPO trainer
  • Figure 3: Num. questions asked over interactions with varying number of preferences show that Reflection-DPO asks more questions when more preferences exist.
  • Figure 4: Comparison of preference satisfaction rate of different models over the interaction, showcasing when they adapt to differnt preferences.
  • Figure 5: Example of actions taken through the course of a cereal and coffee task, with task component that each action is related to indicated by colors, and what different methods asked about. All baselines ask a similar set of 3-4 questions, but Reflection-DPO is able to probe further for more preferences.
  • ...and 1 more figures