Table of Contents
Fetching ...

MotIF: Motion Instruction Fine-tuning

Minyoung Hwang, Joey Hejna, Dorsa Sadigh, Yonatan Bisk

TL;DR

MotIF targets the challenge that success in robotics often depends on how a task is performed rather than only the final outcome. It introduces a trajectory-grounded visual representation by overlaying keypoint trajectories on the final frame and fine-tunes vision-language models to perform motion-aware success detection. The MotIF-1K dataset collects extensive human and robot demonstrations across 13 tasks, enabling fine-tuning with co-training to generalize across unseen motions and environments. Empirical results show MotIF substantially surpasses state-of-the-art VLMs in precision and recall and enables practical uses in planning refinement, termination, and trajectory ranking. This approach provides a scalable, grounded signal for evaluating and guiding robot motion in complex, semantically rich scenes.

Abstract

While success in many robotics tasks can be determined by only observing the final state and how it differs from the initial state - e.g., if an apple is picked up - many tasks require observing the full motion of the robot to correctly determine success. For example, brushing hair requires repeated strokes that correspond to the contours and type of hair. Prior works often use off-the-shelf vision-language models (VLMs) as success detectors; however, when success depends on the full trajectory, VLMs struggle to make correct judgments for two reasons. First, modern VLMs are trained only on single frames, and cannot capture changes over a full trajectory. Second, even if we provide state-of-the-art VLMs with an aggregate input of multiple frames, they still fail to detect success due to a lack of robot data. Our key idea is to fine-tune VLMs using abstract representations that are able to capture trajectory-level information such as the path the robot takes by overlaying keypoint trajectories on the final image. We propose motion instruction fine-tuning (MotIF), a method that fine-tunes VLMs using the aforementioned abstract representations to semantically ground the robot's behavior in the environment. To benchmark and fine-tune VLMs for robotic motion understanding, we introduce the MotIF-1K dataset containing 653 human and 369 robot demonstrations across 13 task categories. MotIF assesses the success of robot motion given the image observation of the trajectory, task instruction, and motion description. Our model significantly outperforms state-of-the-art VLMs by at least twice in precision and 56.1% in recall, generalizing across unseen motions, tasks, and environments. Finally, we demonstrate practical applications of MotIF in refining and terminating robot planning, and ranking trajectories on how they align with task and motion descriptions. Project page: https://motif-1k.github.io

MotIF: Motion Instruction Fine-tuning

TL;DR

MotIF targets the challenge that success in robotics often depends on how a task is performed rather than only the final outcome. It introduces a trajectory-grounded visual representation by overlaying keypoint trajectories on the final frame and fine-tunes vision-language models to perform motion-aware success detection. The MotIF-1K dataset collects extensive human and robot demonstrations across 13 tasks, enabling fine-tuning with co-training to generalize across unseen motions and environments. Empirical results show MotIF substantially surpasses state-of-the-art VLMs in precision and recall and enables practical uses in planning refinement, termination, and trajectory ranking. This approach provides a scalable, grounded signal for evaluating and guiding robot motion in complex, semantically rich scenes.

Abstract

While success in many robotics tasks can be determined by only observing the final state and how it differs from the initial state - e.g., if an apple is picked up - many tasks require observing the full motion of the robot to correctly determine success. For example, brushing hair requires repeated strokes that correspond to the contours and type of hair. Prior works often use off-the-shelf vision-language models (VLMs) as success detectors; however, when success depends on the full trajectory, VLMs struggle to make correct judgments for two reasons. First, modern VLMs are trained only on single frames, and cannot capture changes over a full trajectory. Second, even if we provide state-of-the-art VLMs with an aggregate input of multiple frames, they still fail to detect success due to a lack of robot data. Our key idea is to fine-tune VLMs using abstract representations that are able to capture trajectory-level information such as the path the robot takes by overlaying keypoint trajectories on the final image. We propose motion instruction fine-tuning (MotIF), a method that fine-tunes VLMs using the aforementioned abstract representations to semantically ground the robot's behavior in the environment. To benchmark and fine-tune VLMs for robotic motion understanding, we introduce the MotIF-1K dataset containing 653 human and 369 robot demonstrations across 13 task categories. MotIF assesses the success of robot motion given the image observation of the trajectory, task instruction, and motion description. Our model significantly outperforms state-of-the-art VLMs by at least twice in precision and 56.1% in recall, generalizing across unseen motions, tasks, and environments. Finally, we demonstrate practical applications of MotIF in refining and terminating robot planning, and ranking trajectories on how they align with task and motion descriptions. Project page: https://motif-1k.github.io
Paper Structure (15 sections, 16 figures, 11 tables)

This paper contains 15 sections, 16 figures, 11 tables.

Figures (16)

  • Figure 1: Different robotic motions for various tasks. For each task, we visualize two different motions (path 1 and 2) from real robot demonstrations, where the trajectories share the same initial and final states. Most existing success detectors ignore intermediate states, thereby cannot distinguish them.
  • Figure 2: Visual Motion Representations We explore three visual motion representations: (a) single keypoint tracking, (b) optical flow, and (c-d) multi-frame storyboard. For single keypoint tracking, temporal changes are shown with color gradient from white to green, ending with a red circle. For optical flow, we visualize the flow of all keypoints with rainbow colors. We sample $N$ keyframes for $N$-frame storyboard.
  • Figure 3: Network Architecture. Given a visual motion representation of a robot’s trajectory and its corresponding task and motion specifications, our model outputs a binary value indicating whether the motion is correct (1) or incorrect (0).
  • Figure 4: Trajectory Visualizations. We visualize two different motions for solving the same task with the same embodiment.
  • Figure 5: Motion Diversity. 10 canonical motions are described with blue arrows and their corresponding descriptions. In the blue box, motion primitives are categorized based on the shape of the ideal path. Gray dashed arrows denote variants of the blue arrow, lying in the same category. The orange box shows motions that involve grounding in the environment, where the relationship between the robot and an instance in the environment is considered.
  • ...and 11 more figures