Table of Contents
Fetching ...

Affordance-based Robot Manipulation with Flow Matching

Fan Zhang, Michael Gienger

TL;DR

This work addresses the challenge of enabling efficient, multi-task affordance understanding and action generation for assistive robotics. It proposes a parameter-efficient prompt-tuning approach that integrates language-conditioned prompts into a frozen vision backbone to produce manipulation affordances across tasks, paired with a Flow Matching policy that deterministically morphs random waypoints into 6D robot trajectories under the guidance of affordances. Empirical results on a real-world ADLs dataset show competitive affordance accuracy and strong, stable performance for Flow Matching, including fast inference that approaches or surpasses diffusion-based methods in many settings. The findings demonstrate a practical, modular framework that unifies high-level affordance reasoning with low-level motion generation, providing a scalable path toward robust, real-time robot manipulation in daily-living scenarios.

Abstract

We present a framework for assistive robot manipulation, which focuses on two fundamental challenges: first, efficiently adapting large-scale models to downstream scene affordance understanding tasks, especially in daily living scenarios where gathering multi-task data involving humans requires strenuous effort; second, effectively learning robot action trajectories by grounding the visual affordance model. We tackle the first challenge by employing a parameter-efficient prompt tuning method that prepends learnable text prompts to the frozen vision model to predict manipulation affordances in multi-task scenarios. Then we propose to learn robot action trajectories guided by affordances in a supervised flow matching method. Flow matching represents a robot visuomotor policy as a conditional process of flowing random waypoints to desired robot action trajectories. Finally, we introduce a real-world dataset with 10 tasks across Activities of Daily Living to test our framework. Our extensive evaluation highlights that the proposed prompt tuning method for learning manipulation affordance achieves competitive performance and even outperforms some other finetuning protocols across data scales, while satisfying parameter efficiency. Learning multi-task robot action trajectories with flow matching leads to consistently favorable results in several robot manipulation benchmarks than some alternative behavior cloning methods. This includes more stable training and evaluation, and noticeably faster inference, while maintaining comparable generalization performance to diffusion policy, where flow matching performs marginally better in most cases. Our framework seamlessly unifies affordance learning and action generation with flow matching for robot manipulation.

Affordance-based Robot Manipulation with Flow Matching

TL;DR

This work addresses the challenge of enabling efficient, multi-task affordance understanding and action generation for assistive robotics. It proposes a parameter-efficient prompt-tuning approach that integrates language-conditioned prompts into a frozen vision backbone to produce manipulation affordances across tasks, paired with a Flow Matching policy that deterministically morphs random waypoints into 6D robot trajectories under the guidance of affordances. Empirical results on a real-world ADLs dataset show competitive affordance accuracy and strong, stable performance for Flow Matching, including fast inference that approaches or surpasses diffusion-based methods in many settings. The findings demonstrate a practical, modular framework that unifies high-level affordance reasoning with low-level motion generation, providing a scalable path toward robust, real-time robot manipulation in daily-living scenarios.

Abstract

We present a framework for assistive robot manipulation, which focuses on two fundamental challenges: first, efficiently adapting large-scale models to downstream scene affordance understanding tasks, especially in daily living scenarios where gathering multi-task data involving humans requires strenuous effort; second, effectively learning robot action trajectories by grounding the visual affordance model. We tackle the first challenge by employing a parameter-efficient prompt tuning method that prepends learnable text prompts to the frozen vision model to predict manipulation affordances in multi-task scenarios. Then we propose to learn robot action trajectories guided by affordances in a supervised flow matching method. Flow matching represents a robot visuomotor policy as a conditional process of flowing random waypoints to desired robot action trajectories. Finally, we introduce a real-world dataset with 10 tasks across Activities of Daily Living to test our framework. Our extensive evaluation highlights that the proposed prompt tuning method for learning manipulation affordance achieves competitive performance and even outperforms some other finetuning protocols across data scales, while satisfying parameter efficiency. Learning multi-task robot action trajectories with flow matching leads to consistently favorable results in several robot manipulation benchmarks than some alternative behavior cloning methods. This includes more stable training and evaluation, and noticeably faster inference, while maintaining comparable generalization performance to diffusion policy, where flow matching performs marginally better in most cases. Our framework seamlessly unifies affordance learning and action generation with flow matching for robot manipulation.
Paper Structure (30 sections, 11 equations, 8 figures, 6 tables, 2 algorithms)

This paper contains 30 sections, 11 equations, 8 figures, 6 tables, 2 algorithms.

Figures (8)

  • Figure 1: The proposed framework of unifying affordance map learning and action generation for robot manipulation. Given the same visual scene with different language instructions, the model first extracts instruction-relevant manipulation affordances. This is achieved through a prompt tuning method that prepends learnable text-conditioned prompts in a frozen vision foundation model. Then, a flow matching policy is proposed to transform the random waypoints to the desired action trajectories, guided by task-relevant affordance maps.
  • Figure 2: Overview of prompt tuning structures used for affordance learning. (Left) For the shallow structure, text-conditioned prompts are prepended to the first vision transformer layer. (Right) For the deep structure, prompts are inserted into every vision layer. Only the prompt-related layers and the decoder are being updated during the training, while the vision transformer remains frozen.
  • Figure 3: Framework of flow matching policy. (a) General formulation. At each time step, flow matching takes visual observation $\bm{o}$ (e. g., state-based inputs, RGB-D images, visual affordances) as input, and outputs robot actions (e. g., 6D robot end-effector trajectories, robot joint actions, gripper actions). (b-d) Visualization of the inference process of transforming random waypoints to target actions over time from 0 (green) to 1 (purple). Red lines in (b) denote the flow paths. Vector fields are shown in (c).
  • Figure 4: t-SNE visualizations of the embeddings before the decoder. The points of the same color denote the tasks with same language prompts, which are embedded together. The prompt tuning method could produce instruction-relevant features without updating vision backbone parameters.
  • Figure 5: Ablation studies of prompt tuning. We investigate the effect of various design choices on affordance learning performance, including pretrained weights, decoder input, dataset size and prompt location.
  • ...and 3 more figures