Visual Preference Inference: An Image Sequence-Based Preference Reasoning in Tabletop Object Manipulation

Joonhyung Lee; Sangbeom Park; Yongin Kwon; Jemin Lee; Minwook Ahn; Sungjoon Choi

Visual Preference Inference: An Image Sequence-Based Preference Reasoning in Tabletop Object Manipulation

Joonhyung Lee, Sangbeom Park, Yongin Kwon, Jemin Lee, Minwook Ahn, Sungjoon Choi

TL;DR

The Chain-of-Visual-Residuals (CoVR) method employs a prompting mechanism that describes the difference between the consecutive images (i.e., visual residuals) and incorporates such texts with a sequence of images to infer the user’s preference.

Abstract

In robotic object manipulation, human preferences can often be influenced by the visual attributes of objects, such as color and shape. These properties play a crucial role in operating a robot to interact with objects and align with human intention. In this paper, we focus on the problem of inferring underlying human preferences from a sequence of raw visual observations in tabletop manipulation environments with a variety of object types, named Visual Preference Inference (VPI). To facilitate visual reasoning in the context of manipulation, we introduce the Chain-of-Visual-Residuals (CoVR) method. CoVR employs a prompting mechanism that describes the difference between the consecutive images (i.e., visual residuals) and incorporates such texts with a sequence of images to infer the user's preference. This approach significantly enhances the ability to understand and adapt to dynamic changes in its visual environment during manipulation tasks. Furthermore, we incorporate such texts along with a sequence of images to infer the user's preferences. Our method outperforms baseline methods in terms of extracting human preferences from visual sequences in both simulation and real-world environments. Code and videos are available at: \href{https://joonhyung-lee.github.io/vpi/}{https://joonhyung-lee.github.io/vpi/}

Visual Preference Inference: An Image Sequence-Based Preference Reasoning in Tabletop Object Manipulation

TL;DR

Abstract

Paper Structure (21 sections, 4 equations, 6 figures, 4 tables)

This paper contains 21 sections, 4 equations, 6 figures, 4 tables.

Introduction
Related Work
Human Preferences in Robotics
Multimodal Large models for Robotics
Problem Formulation
Proposed Method
Visual Reasoning Descriptor
Preference Reasoning Descriptor
Experiments
Baselines & Metrics
Block Task: Spatial Pattern Preference Reasoning
Setup
Results
Polygon Task: Semantic Preference Reasoning
Setup
...and 6 more sections

Figures (6)

Figure 1: Visual Preference Inference (VPI) Tasks. We define VPI tasks as reasoning user preferences based on an image sequence. Specifically, the task involves a robot that moves objects to target locations, following user instructions via mouse clicks which provide which object to move and where to place it.
Figure 2: Overview of Chain-of-Visual-Residuals: (a) We introduce a Visual Preference Inference (VPI) task, which extracts users' preferences solely from visual representations in tabletop manipulation environments. Our approach, CoVR prompting, involves generating (b) visual reasoning descriptions of consecutive images and (c) chaining these descriptions for interpreting human preferences from the scene sequences.
Figure 3: Examples of preferences for simulation scenario: (a) spatial pattern preferences arranged within a horizontal line, and (b) semantic preferences grouped with the same shaped objects.
Figure 4: Household objects: We use various daily objects to test our approach, some of which can be categorized by terms of color, shape, or category.
Figure 5: This figure illustrates the application of CoVR in a scenario where objects are rearranged based on their category. The result of each visual residual shows the model's ability to identify semantic and geometric properties of objects, emphasizing the practical utility of CoVR in tasks that require a visual understanding of object properties and spatial relationships. See more videos and tasks at https://joonhyung-lee.github.io/vpi/
...and 1 more figures

Visual Preference Inference: An Image Sequence-Based Preference Reasoning in Tabletop Object Manipulation

TL;DR

Abstract

Visual Preference Inference: An Image Sequence-Based Preference Reasoning in Tabletop Object Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)