Table of Contents
Fetching ...

CoA-VLA: Improving Vision-Language-Action Models via Visual-Textual Chain-of-Affordance

Jinming Li, Yichen Zhu, Zhibin Tang, Junjie Wen, Minjie Zhu, Xiaoyu Liu, Chengmeng Li, Ran Cheng, Yaxin Peng, Yan Peng, Feifei Feng

TL;DR

CoA-VLA introduces a structured chain-of-affordance for Vision-Language-Action models, enabling sequential object, grasp, spatial, and movement reasoning to ground action in physical context. It integrates textual and visual affordances through a visual-language co-injection module with FiLM conditioning, and generates large-scale affordance data via a GPT-4o–driven pipeline with grounding and tracking tools. Empirically, CoA-VLA surpasses state-of-the-art baselines on real-robot tasks and the LIBERO simulation benchmark, while showing strong generalization to unseen poses and obstacle-rich environments. The approach achieves higher robustness and efficiency by adopting dynamic affordance selection, demonstrating the practical impact of explicit, affordance-aware reasoning for scalable, generalizable robotic manipulation.

Abstract

Robot foundation models, particularly Vision-Language-Action (VLA) models, have garnered significant attention for their ability to enhance robot policy learning, greatly improving robot's generalization and robustness. OpenAI's recent model, O1, showcased impressive capabilities in solving complex problems by utilizing extensive reasoning chains. This prompts an important question: can robot models achieve better performance in multi-task , complex environments by reviewing prior observations and then providing task-specific reasoning to guide action prediction? In this paper, we introduce Chain-of-Affordance (CoA-VLA) , a novel approach to scaling robot models by incorporating reasoning in the format of sequential robot affordances to facilitate task completion. Specifically, we prompt the model to consider the following four types of affordances before taking action: (1) object affordance - what object to manipulate and where it is ; (2) grasp affordance - the specific object part to grasp ; (3) spatial affordance - the optimal space to place the object ; and (4) movement affordance-the collision - free path for movement. We further transform each affordance into two prompting formats: visual affordance and textual affordance. We introduce a novel vision-language co-injection module that integrates this knowledge into the policy network. This allows the robot to leverage essential contextual information during action inference, resulting in improved precision and robustness. Our experiments demonstrate that CoA-VLA outperforms state-of-the-art robot foundation models, including OpenVLA and Octo, on a variety of tasks. Furthermore, CoA-VLA exhibits strong generalization capabilities, including recognizing unseen object poses, identifying free space, and avoiding obstacles in novel environments.

CoA-VLA: Improving Vision-Language-Action Models via Visual-Textual Chain-of-Affordance

TL;DR

CoA-VLA introduces a structured chain-of-affordance for Vision-Language-Action models, enabling sequential object, grasp, spatial, and movement reasoning to ground action in physical context. It integrates textual and visual affordances through a visual-language co-injection module with FiLM conditioning, and generates large-scale affordance data via a GPT-4o–driven pipeline with grounding and tracking tools. Empirically, CoA-VLA surpasses state-of-the-art baselines on real-robot tasks and the LIBERO simulation benchmark, while showing strong generalization to unseen poses and obstacle-rich environments. The approach achieves higher robustness and efficiency by adopting dynamic affordance selection, demonstrating the practical impact of explicit, affordance-aware reasoning for scalable, generalizable robotic manipulation.

Abstract

Robot foundation models, particularly Vision-Language-Action (VLA) models, have garnered significant attention for their ability to enhance robot policy learning, greatly improving robot's generalization and robustness. OpenAI's recent model, O1, showcased impressive capabilities in solving complex problems by utilizing extensive reasoning chains. This prompts an important question: can robot models achieve better performance in multi-task , complex environments by reviewing prior observations and then providing task-specific reasoning to guide action prediction? In this paper, we introduce Chain-of-Affordance (CoA-VLA) , a novel approach to scaling robot models by incorporating reasoning in the format of sequential robot affordances to facilitate task completion. Specifically, we prompt the model to consider the following four types of affordances before taking action: (1) object affordance - what object to manipulate and where it is ; (2) grasp affordance - the specific object part to grasp ; (3) spatial affordance - the optimal space to place the object ; and (4) movement affordance-the collision - free path for movement. We further transform each affordance into two prompting formats: visual affordance and textual affordance. We introduce a novel vision-language co-injection module that integrates this knowledge into the policy network. This allows the robot to leverage essential contextual information during action inference, resulting in improved precision and robustness. Our experiments demonstrate that CoA-VLA outperforms state-of-the-art robot foundation models, including OpenVLA and Octo, on a variety of tasks. Furthermore, CoA-VLA exhibits strong generalization capabilities, including recognizing unseen object poses, identifying free space, and avoiding obstacles in novel environments.
Paper Structure (20 sections, 8 figures, 5 tables)

This paper contains 20 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: This figure illustrates the overall framework of our CoA-VLA model, which empowers vision-language-action models with chain-of-thought reasoning capabilities for generalizable visuomotor policy learning. We achieve this by designing four distinct types of affordance and introducing a novel visual-text co-injection method to integrate this knowledge into the decision-making process.
  • Figure 2: An example of the chain-of-affordance for the PourTea task. The first row presents the text affordance and the second row shows the visual affordance. By employing a dynamic affordance selection mechanism, our method avoids generating redundant affordances at every timestep.
  • Figure 3: Robot setup and examples for real-world manipulation tasks. We evaluate seven real-world tasks on Franka robot arm equipped with two external Zed cameras and a Realsense 435i wrist camera.
  • Figure 4: Spatial affordance for CoA-VLA. CoA-VLA can identify free space for object placement..
  • Figure 5: Movement generalization for CoA-VLA. CoA-VLA can avoid obstacles and operate safely.
  • ...and 3 more figures