Table of Contents
Fetching ...

IntentionVLA: Generalizable and Efficient Embodied Intention Reasoning for Human-Robot Interaction

Yandu Chen, Kefan Gu, Yuqing Wen, Yucheng Zhao, Tiancai Wang, Liqiang Nie

TL;DR

IntentionVLA tackles the gap between multimodal perception and embodied reasoning in Vision-Language-Action models by introducing a curriculum-based dataset and a two-stage training pipeline that first teaches embodied intention reasoning and spatial grounding, then distills these insights into compact cues to condition a diffusion-based action generator. The model achieves strong in-distribution and out-of-distribution generalization, including zero-shot human-robot interaction, and demonstrates real-time inference through efficient compact reasoning. Key innovations include three reasoning formats (intention, spatial grounding, compact reasoning), learnable queries bridging reasoning and action, and a diffusion-based controller with a lightweight training scheme. The work shows substantial improvements over state-of-the-art baselines across direct/intention instructions, novel object manipulation, and real-world HRI, with strong multimodal understanding performance and ablations confirming the value of each design choice.

Abstract

Vision-Language-Action (VLA) models leverage pretrained vision-language models (VLMs) to couple perception with robotic control, offering a promising path toward general-purpose embodied intelligence. However, current SOTA VLAs are primarily pretrained on multimodal tasks with limited relevance to embodied scenarios, and then finetuned to map explicit instructions to actions. Consequently, due to the lack of reasoning-intensive pretraining and reasoning-guided manipulation, these models are unable to perform implicit human intention reasoning required for complex, real-world interactions. To overcome these limitations, we propose \textbf{IntentionVLA}, a VLA framework with a curriculum training paradigm and an efficient inference mechanism. Our proposed method first leverages carefully designed reasoning data that combine intention inference, spatial grounding, and compact embodied reasoning, endowing the model with both reasoning and perception capabilities. In the following finetuning stage, IntentionVLA employs the compact reasoning outputs as contextual guidance for action generation, enabling fast inference under indirect instructions. Experimental results show that IntentionVLA substantially outperforms $π_0$, achieving 18\% higher success rates with direct instructions and 28\% higher than ECoT under intention instructions. On out-of-distribution intention tasks, IntentionVLA achieves over twice the success rate of all baselines, and further enables zero-shot human-robot interaction with 40\% success rate. These results highlight IntentionVLA as a promising paradigm for next-generation human-robot interaction (HRI) systems.

IntentionVLA: Generalizable and Efficient Embodied Intention Reasoning for Human-Robot Interaction

TL;DR

IntentionVLA tackles the gap between multimodal perception and embodied reasoning in Vision-Language-Action models by introducing a curriculum-based dataset and a two-stage training pipeline that first teaches embodied intention reasoning and spatial grounding, then distills these insights into compact cues to condition a diffusion-based action generator. The model achieves strong in-distribution and out-of-distribution generalization, including zero-shot human-robot interaction, and demonstrates real-time inference through efficient compact reasoning. Key innovations include three reasoning formats (intention, spatial grounding, compact reasoning), learnable queries bridging reasoning and action, and a diffusion-based controller with a lightweight training scheme. The work shows substantial improvements over state-of-the-art baselines across direct/intention instructions, novel object manipulation, and real-world HRI, with strong multimodal understanding performance and ablations confirming the value of each design choice.

Abstract

Vision-Language-Action (VLA) models leverage pretrained vision-language models (VLMs) to couple perception with robotic control, offering a promising path toward general-purpose embodied intelligence. However, current SOTA VLAs are primarily pretrained on multimodal tasks with limited relevance to embodied scenarios, and then finetuned to map explicit instructions to actions. Consequently, due to the lack of reasoning-intensive pretraining and reasoning-guided manipulation, these models are unable to perform implicit human intention reasoning required for complex, real-world interactions. To overcome these limitations, we propose \textbf{IntentionVLA}, a VLA framework with a curriculum training paradigm and an efficient inference mechanism. Our proposed method first leverages carefully designed reasoning data that combine intention inference, spatial grounding, and compact embodied reasoning, endowing the model with both reasoning and perception capabilities. In the following finetuning stage, IntentionVLA employs the compact reasoning outputs as contextual guidance for action generation, enabling fast inference under indirect instructions. Experimental results show that IntentionVLA substantially outperforms , achieving 18\% higher success rates with direct instructions and 28\% higher than ECoT under intention instructions. On out-of-distribution intention tasks, IntentionVLA achieves over twice the success rate of all baselines, and further enables zero-shot human-robot interaction with 40\% success rate. These results highlight IntentionVLA as a promising paradigm for next-generation human-robot interaction (HRI) systems.

Paper Structure

This paper contains 13 sections, 5 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Problems with existing VLAs. Given the instruction "I want to call my friend", ECoT (right) reaches for the phone but infers slowly, while $\pi_0$ (middle) misinterprets the instruction and grasps the rag. In contrast, our method (left) correctly infers the user's intention and enables rapid task completion.
  • Figure 2: Visualization of annotation module. We visualize the four fully automated modules that constitute the data pipeline. The intermediate output of each module will be integrated to form the final reasoning data.
  • Figure 3: Overview of our proposed reasoning data and efficient annotation pipeline. The pipeline consists of 4 decoupled modules that can run in parallel for high annotation efficiency. Intention and spatial reasoning chains are further compressed into compact short reasoning for fast inference.
  • Figure 4: Overview of IntentionVLA framework. IntentionVLA achieves intention inference and reasoning-guided manipulation in one unified model. We first pretrain the VLM backbone with diverse intention reasoning data. Then we finetune the action module to decode action chunk which follows the compact reasoning output.
  • Figure 5: Real-world experiment setting
  • ...and 2 more figures