Table of Contents
Fetching ...

DegustaBot: Zero-Shot Visual Preference Estimation for Personalized Multi-Object Rearrangement

Benjamin A. Newman, Pranay Gupta, Kris Kitani, Yonatan Bisk, Henny Admoni, Chris Paxton

TL;DR

DegustaBot tackles personalized multi-object rearrangement by inferring user preferences from visual context and leveraging zero-shot prompting of vision-language foundation models to generate task plans. It formalizes the problem with a visual-grounded preference history and introduces lifting functions that ground object and placement information into images for VLM reasoning. Across synthetic and naturalistic table-setting data, GPT-4o with grid-marked lifting delivers the strongest alignment with user preferences, achieving acceptable predictions for a meaningful portion of users, while highlighting the challenge of naturalistic preference learning. The work also contributes a large naturalistic dataset and evaluation metrics that connect geometric similarity (RMSD) with subjective acceptability, underscoring the practical potential and limitations of zero-shot visual preference grounding in home robotics.

Abstract

De gustibus non est disputandum ("there is no accounting for others' tastes") is a common Latin maxim describing how many solutions in life are determined by people's personal preferences. Many household tasks, in particular, can only be considered fully successful when they account for personal preferences such as the visual aesthetic of the scene. For example, setting a table could be optimized by arranging utensils according to traditional rules of Western table setting decorum, without considering the color, shape, or material of each object, but this may not be a completely satisfying solution for a given person. Toward this end, we present DegustaBot, an algorithm for visual preference learning that solves household multi-object rearrangement tasks according to personal preference. To do this, we use internet-scale pre-trained vision-and-language foundation models (VLMs) with novel zero-shot visual prompting techniques. To evaluate our method, we collect a large dataset of naturalistic personal preferences in a simulated table-setting task, and conduct a user study in order to develop two novel metrics for determining success based on personal preference. This is a challenging problem and we find that 50% of our model's predictions are likely to be found acceptable by at least 20% of people.

DegustaBot: Zero-Shot Visual Preference Estimation for Personalized Multi-Object Rearrangement

TL;DR

DegustaBot tackles personalized multi-object rearrangement by inferring user preferences from visual context and leveraging zero-shot prompting of vision-language foundation models to generate task plans. It formalizes the problem with a visual-grounded preference history and introduces lifting functions that ground object and placement information into images for VLM reasoning. Across synthetic and naturalistic table-setting data, GPT-4o with grid-marked lifting delivers the strongest alignment with user preferences, achieving acceptable predictions for a meaningful portion of users, while highlighting the challenge of naturalistic preference learning. The work also contributes a large naturalistic dataset and evaluation metrics that connect geometric similarity (RMSD) with subjective acceptability, underscoring the practical potential and limitations of zero-shot visual preference grounding in home robotics.

Abstract

De gustibus non est disputandum ("there is no accounting for others' tastes") is a common Latin maxim describing how many solutions in life are determined by people's personal preferences. Many household tasks, in particular, can only be considered fully successful when they account for personal preferences such as the visual aesthetic of the scene. For example, setting a table could be optimized by arranging utensils according to traditional rules of Western table setting decorum, without considering the color, shape, or material of each object, but this may not be a completely satisfying solution for a given person. Toward this end, we present DegustaBot, an algorithm for visual preference learning that solves household multi-object rearrangement tasks according to personal preference. To do this, we use internet-scale pre-trained vision-and-language foundation models (VLMs) with novel zero-shot visual prompting techniques. To evaluate our method, we collect a large dataset of naturalistic personal preferences in a simulated table-setting task, and conduct a user study in order to develop two novel metrics for determining success based on personal preference. This is a challenging problem and we find that 50% of our model's predictions are likely to be found acceptable by at least 20% of people.
Paper Structure (29 sections, 2 equations, 10 figures, 3 tables)

This paper contains 29 sections, 2 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: DegustaBot takes in a single person's preferred table arrangements (shown here as the visual context $k_i$ and order of object placement $\mathcal{T}_i$), the objects $O$ from which the algorithm can select, the table to set $a_0$ and the task prompt $\ell$. The robot then produces a task plan in the form of an object arrangement (visualized as an image $a_T$) and the order of object placement $\mathcal{T}$. This predicted arrangement, Pred, should match a held-out preference created by the user, GT.
  • Figure 2: Details of a table arrangement. An arrangement is described by the objects within the arrangement, the order in which they are placed, their features (such as color, shape and material, and their location and orientation.
  • Figure 3: Object and arrangement lifting functions, from left to right: OaL, a representation of objects as language, for this we use the json representation; GoMO, grid of marked objects, which represents objects visually with referential marks overlaid on each object; UmA, Unmarked Arrangement lifts the arrangement into the image domain; and finally GMA, Grid-Marked Arrangement, which overlays a spatial reference grid on the continuous table top space.
  • Figure 4: Quantitative results for evaluating DegustaBot on simulated preferences. Left shows the performance of each model's ability to capture the geometry of the table arrangement, as measured by RMSD. Right shows each model and method's accuracy when choosing items to place in the arrangement. GPT-4o and MOGMA perform the best on both metrics.
  • Figure 5: Qualitative results. On the top line we see ground truth images from our naturalistic preference data and on the bottom we see DegustaBot's predictions, lifted into the image domain. On the left side of the image we see three examples where DegustaBot predicts similar arrangements to the ground truth image. On the right we see some failure cases.
  • ...and 5 more figures