Table of Contents
Fetching ...

Controlling the World by Sleight of Hand

Sruthi Sudhakar, Ruoshi Liu, Basile Van Hoorick, Carl Vondrick, Richard Zemel

TL;DR

The paper addresses enabling machines to predict object-state changes caused by actions by learning from unlabeled hand-object video data. It presents CosHand, a diffusion-based model finetuned from a pretrained image model and conditioned on the input image, a hand mask, and a target interaction mask to synthesize future scenes. The method demonstrates strong generalization to unseen objects, backgrounds, and even robot-arm embodiments, and can sample multiple futures to reflect uncertainty in forces and dynamics. The work suggests that leveraging hand-conditioned priors with large-scale video data can provide scalable, versatile world models for robotic planning, controllable image editing, and AR/VR applications.

Abstract

Humans naturally build mental models of object interactions and dynamics, allowing them to imagine how their surroundings will change if they take a certain action. While generative models today have shown impressive results on generating/editing images unconditionally or conditioned on text, current methods do not provide the ability to perform object manipulation conditioned on actions, an important tool for world modeling and action planning. Therefore, we propose to learn an action-conditional generative models by learning from unlabeled videos of human hands interacting with objects. The vast quantity of such data on the internet allows for efficient scaling which can enable high-performing action-conditional models. Given an image, and the shape/location of a desired hand interaction, CosHand, synthesizes an image of a future after the interaction has occurred. Experiments show that the resulting model can predict the effects of hand-object interactions well, with strong generalization particularly to translation, stretching, and squeezing interactions of unseen objects in unseen environments. Further, CosHand can be sampled many times to predict multiple possible effects, modeling the uncertainty of forces in the interaction/environment. Finally, method generalizes to different embodiments, including non-human hands, i.e. robot hands, suggesting that generative video models can be powerful models for robotics.

Controlling the World by Sleight of Hand

TL;DR

The paper addresses enabling machines to predict object-state changes caused by actions by learning from unlabeled hand-object video data. It presents CosHand, a diffusion-based model finetuned from a pretrained image model and conditioned on the input image, a hand mask, and a target interaction mask to synthesize future scenes. The method demonstrates strong generalization to unseen objects, backgrounds, and even robot-arm embodiments, and can sample multiple futures to reflect uncertainty in forces and dynamics. The work suggests that leveraging hand-conditioned priors with large-scale video data can provide scalable, versatile world models for robotic planning, controllable image editing, and AR/VR applications.

Abstract

Humans naturally build mental models of object interactions and dynamics, allowing them to imagine how their surroundings will change if they take a certain action. While generative models today have shown impressive results on generating/editing images unconditionally or conditioned on text, current methods do not provide the ability to perform object manipulation conditioned on actions, an important tool for world modeling and action planning. Therefore, we propose to learn an action-conditional generative models by learning from unlabeled videos of human hands interacting with objects. The vast quantity of such data on the internet allows for efficient scaling which can enable high-performing action-conditional models. Given an image, and the shape/location of a desired hand interaction, CosHand, synthesizes an image of a future after the interaction has occurred. Experiments show that the resulting model can predict the effects of hand-object interactions well, with strong generalization particularly to translation, stretching, and squeezing interactions of unseen objects in unseen environments. Further, CosHand can be sampled many times to predict multiple possible effects, modeling the uncertainty of forces in the interaction/environment. Finally, method generalizes to different embodiments, including non-human hands, i.e. robot hands, suggesting that generative video models can be powerful models for robotics.
Paper Structure (17 sections, 2 equations, 11 figures, 2 tables)

This paper contains 17 sections, 2 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: CosHand synthesizes an image of a future after a specific interaction (dotted blue mask) has occurred. In (a) we demonstrate CosHand's ability to perform complex manipulations on deformable objects such as kneading dough and opening a book. In (b) we show generalization to robot gripper interactions. In (c) we show it is possible to generate diverging futures given the same input context but different hand controls.
  • Figure 2: We show that text-conditioning is insufficient to model interactions, whereas hands allow for better control. Columns 1 & 2 show the input image, query caption, and output of text conditional generation. Columns 3 & 4 show the input image, query hand mask, and output of CosHand. Column 5 shows the ground truth output. Notice that CosHand is able to achieve precise control (including the exact final location of the knife in row 1 and the precise squeezing motion in rows 2 & 3) which results in a output that is more consistent with the ground truth.
  • Figure 3: CosHand Method. We propose a novel approach of controlling by hands to enable manipulating objects in an image. Given an image, the corresponding hand mask, and a query hand mask of the desired interaction, CosHand synthesizes an image with the interaction applied. Such visual conditioning allows for object interaction.
  • Figure 4: Examples In-the-wild (from our lab/home environment). We test CosHand against challenging In-the-wild collected in our home/lab environments. CosHand remains robust in these scenarios, showcasing its strong generalization ability.
  • Figure 5: We show that CosHand can perform complex manipulations on a variety of rigid and deformable objects. We show interactions such as squeezing a lemon, closing a drawer, rotating a bottle, and placing items inside cups, which requires understanding of deformable and articulated objects, as well as occlusion. In columns 1, 3 & 5 we visualize the input image and the query hand mask of the desired interaction. Columns 2, 4 & 6 portray the respective outputs of the applied hand interaction.
  • ...and 6 more figures