Table of Contents
Fetching ...

How to Peel with a Knife: Aligning Fine-Grained Manipulation with Human Preference

Toru Lin, Shuying Deng, Zhao-Heng Yin, Pieter Abbeel, Jitendra Malik

TL;DR

This work presents a learning framework for essential manipulation tasks, using peeling with a knife as a representative example, and exhibits strong zero-shot generalization to unseen in-category instances and to out-of-distribution produce from different categories while maintaining over 90% success rates.

Abstract

Many essential manipulation tasks - such as food preparation, surgery, and craftsmanship - remain intractable for autonomous robots. These tasks are characterized not only by contact-rich, force-sensitive dynamics, but also by their "implicit" success criteria: unlike pick-and-place, task quality in these domains is continuous and subjective (e.g. how well a potato is peeled), making quantitative evaluation and reward engineering difficult. We present a learning framework for such tasks, using peeling with a knife as a representative example. Our approach follows a two-stage pipeline: first, we learn a robust initial policy via force-aware data collection and imitation learning, enabling generalization across object variations; second, we refine the policy through preference-based finetuning using a learned reward model that combines quantitative task metrics with qualitative human feedback, aligning policy behavior with human notions of task quality. Using only 50-200 peeling trajectories, our system achieves over 90% average success rates on challenging produce including cucumbers, apples, and potatoes, with performance improving by up to 40% through preference-based finetuning. Remarkably, policies trained on a single produce category exhibit strong zero-shot generalization to unseen in-category instances and to out-of-distribution produce from different categories while maintaining over 90% success rates.

How to Peel with a Knife: Aligning Fine-Grained Manipulation with Human Preference

TL;DR

This work presents a learning framework for essential manipulation tasks, using peeling with a knife as a representative example, and exhibits strong zero-shot generalization to unseen in-category instances and to out-of-distribution produce from different categories while maintaining over 90% success rates.

Abstract

Many essential manipulation tasks - such as food preparation, surgery, and craftsmanship - remain intractable for autonomous robots. These tasks are characterized not only by contact-rich, force-sensitive dynamics, but also by their "implicit" success criteria: unlike pick-and-place, task quality in these domains is continuous and subjective (e.g. how well a potato is peeled), making quantitative evaluation and reward engineering difficult. We present a learning framework for such tasks, using peeling with a knife as a representative example. Our approach follows a two-stage pipeline: first, we learn a robust initial policy via force-aware data collection and imitation learning, enabling generalization across object variations; second, we refine the policy through preference-based finetuning using a learned reward model that combines quantitative task metrics with qualitative human feedback, aligning policy behavior with human notions of task quality. Using only 50-200 peeling trajectories, our system achieves over 90% average success rates on challenging produce including cucumbers, apples, and potatoes, with performance improving by up to 40% through preference-based finetuning. Remarkably, policies trained on a single produce category exhibit strong zero-shot generalization to unseen in-category instances and to out-of-distribution produce from different categories while maintaining over 90% success rates.
Paper Structure (19 sections, 10 equations, 7 figures, 11 tables)

This paper contains 19 sections, 10 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: An overview of our system setup and learned peeling policies. We use a 7-DoF Kinova Gen3 arm with impedance control. A custom designed mount holding a knife is attached to the tool end. Two wrist cameras are attached to the tool end and pointing towards the knife and produce. We collect data on three types of produce, train peeling policies that zero-shot generalize to six types of produce with a wide range of geometries and surface physical properties, and finetune the policies to align with human preference of peel quality.
  • Figure 2: A overview of our two-stage learning framework. This includes details on data and model architecture for compliant data collection, force-aware imitation learning, and preference-based finetuning from a learned reward model.
  • Figure 3: End-effector mount. A CAD visualization of our custom end-effector mount design, including an arm connector, a force-torque sensor plate, two camera mounts, and a knife mount.
  • Figure 4: Front-view visualization of qualitative score metric. We use integer scores from 0 to 9 (the higher the better) to capture subjective human preferences based on the overall visual appearance of the peel.
  • Figure 5: Wrist-view visualization of quantitative score metric. We use six discrete thickness categories, where nominal denotes the most desired thickness. Details of how these categories are mapped to normalized scalar rewards can be found in Appendix \ref{['app:reward']}.
  • ...and 2 more figures