Table of Contents
Fetching ...

iFlyBot-VLM Technical Report

Xin Nie, Zhiyuan Cheng, Yuan Zhang, Chao Ji, Jiajia Wu, Yuhan Zhang, Jia Pan

TL;DR

iFlyBot-VLM presents a vision-language foundation model tailored for embodied AI by translating rich environmental perception into an operational language that robots can act upon. The approach fuses a three-stage ViT-Projector-LLM architecture with a Dimension-Expanded Position Embedding to enhance spatial reasoning, and it is trained on a diverse suite of ~3.8M samples spanning spatial understanding, grounding, affordances, grasping, trajectories, and task planning. The work reports strong, cross-benchmark performance across pointing, grounding, trajectory fidelity, and planning tasks, while delivering state-of-the-art results on benchmarks like Where2Place, RefSpatial, and BLINK, and competitive metrics on EgoPlan and ERQA. By releasing both data and weights, it aims to accelerate research in embodied intelligence and support generalist, cognitively capable robotic agents across platforms.

Abstract

We introduce iFlyBot-VLM, a general-purpose Vision-Language Model (VLM) used to improve the domain of Embodied Intelligence. The central objective of iFlyBot-VLM is to bridge the cross-modal semantic gap between high-dimensional environmental perception and low-level robotic motion control. To this end, the model abstracts complex visual and spatial information into a body-agnostic and transferable Operational Language, thereby enabling seamless perception-action closed-loop coordination across diverse robotic platforms. The architecture of iFlyBot-VLM is systematically designed to realize four key functional capabilities essential for embodied intelligence: 1) Spatial Understanding and Metric Reasoning; 2) Interactive Target Grounding; 3) Action Abstraction and Control Parameter Generation; 4) Task Planning and Skill Sequencing. We envision iFlyBot-VLM as a scalable and generalizable foundation model for embodied AI, facilitating the progression from specialized task-oriented systems toward generalist, cognitively capable agents. We conducted evaluations on 10 current mainstream embodied intelligence-related VLM benchmark datasets, such as Blink and Where2Place, and achieved optimal performance while preserving the model's general capabilities. We will publicly release both the training data and model weights to foster further research and development in the field of Embodied Intelligence.

iFlyBot-VLM Technical Report

TL;DR

iFlyBot-VLM presents a vision-language foundation model tailored for embodied AI by translating rich environmental perception into an operational language that robots can act upon. The approach fuses a three-stage ViT-Projector-LLM architecture with a Dimension-Expanded Position Embedding to enhance spatial reasoning, and it is trained on a diverse suite of ~3.8M samples spanning spatial understanding, grounding, affordances, grasping, trajectories, and task planning. The work reports strong, cross-benchmark performance across pointing, grounding, trajectory fidelity, and planning tasks, while delivering state-of-the-art results on benchmarks like Where2Place, RefSpatial, and BLINK, and competitive metrics on EgoPlan and ERQA. By releasing both data and weights, it aims to accelerate research in embodied intelligence and support generalist, cognitively capable robotic agents across platforms.

Abstract

We introduce iFlyBot-VLM, a general-purpose Vision-Language Model (VLM) used to improve the domain of Embodied Intelligence. The central objective of iFlyBot-VLM is to bridge the cross-modal semantic gap between high-dimensional environmental perception and low-level robotic motion control. To this end, the model abstracts complex visual and spatial information into a body-agnostic and transferable Operational Language, thereby enabling seamless perception-action closed-loop coordination across diverse robotic platforms. The architecture of iFlyBot-VLM is systematically designed to realize four key functional capabilities essential for embodied intelligence: 1) Spatial Understanding and Metric Reasoning; 2) Interactive Target Grounding; 3) Action Abstraction and Control Parameter Generation; 4) Task Planning and Skill Sequencing. We envision iFlyBot-VLM as a scalable and generalizable foundation model for embodied AI, facilitating the progression from specialized task-oriented systems toward generalist, cognitively capable agents. We conducted evaluations on 10 current mainstream embodied intelligence-related VLM benchmark datasets, such as Blink and Where2Place, and achieved optimal performance while preserving the model's general capabilities. We will publicly release both the training data and model weights to foster further research and development in the field of Embodied Intelligence.

Paper Structure

This paper contains 36 sections, 3 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Overview. The model possesses capabilities in spatial object pointing, 2D trajectories, affordance regions, 3D bounding boxes (3Dbox), object grasping poses, object counting, spatial relationship judgment, multi-image mapping, and task planning. Additionally, it retains excellent multimodal capabilities such as caption generation, Grounding, and Optical Character Recognition (OCR). Moreover, it achieves state-of-the-art (SOTA) performance on multiple evaluation datasets.
  • Figure 2: The iFlyBot-VLM Model. iFlyBot-VLM is a three-stage "ViT-Projector-LLM" paradigm from established Vision-Language Models.
  • Figure 3: Data distribution chart, covering 13 subcategories of data.
  • Figure 4: Training Data.The training data presentation covers five categories of data, including General Multimodal Understanding, Action Abstraction & Control Parameter Generation, Spatial Understanding, Interactive Target Grounding, and Task Planning.
  • Figure 5: Spatial Understanding Data. This figure presents partial examples of spatial understanding data, including visual correspondence, relative depth, counting, camera movement, spatial relation and perspective
  • ...and 9 more figures