Table of Contents
Fetching ...

iFlyBot-VLA Technical Report

Yuan Zhang, Chenyu Xue, Wenjie Xu, Chao Ji, Jiajia wu, Jia Pan

TL;DR

iFlyBot-VLA introduces a latent-action–based VLA framework that blends implicit high-level planning with explicit low-level action tokens to align language, vision, and control. The approach uses a VQ-VAE latent action codebook and a FAST discrete token encoder to supervise a VLM and a flow-matching diffusion action expert, enabling continuous, precise manipulation. Training combines large-scale human/robot datasets with spatial QA data, preserving perception while enhancing 3D reasoning, and demonstrates strong generalization on LIBERO and real-world tasks. The work delivers state-of-the-art results, particularly in long-horizon and dexterous manipulation, and opens access to part of its dataset to foster community research.

Abstract

We introduce iFlyBot-VLA, a large-scale Vision-Language-Action (VLA) model trained under a novel framework. The main contributions are listed as follows: (1) a latent action model thoroughly trained on large-scale human and robotic manipulation videos; (2) a dual-level action representation framework that jointly supervises both the Vision-Language Model (VLM) and the action expert during training; (3) a mixed training strategy that combines robot trajectory data with general QA and spatial QA datasets, effectively enhancing the 3D perceptual and reasoning capabilities of the VLM backbone. Specifically, the VLM is trained to predict two complementary forms of actions: latent actions, derived from our latent action model pretrained on cross-embodiment manipulation data, which capture implicit high-level intentions; and structured discrete action tokens, obtained through frequency-domain transformations of continuous control signals, which encode explicit low-level dynamics. This dual supervision aligns the representation spaces of language, vision, and action, enabling the VLM to directly contribute to action generation. Experimental results on the LIBERO Franka benchmark demonstrate the superiority of our frame-work, while real-world evaluations further show that iFlyBot-VLA achieves competitive success rates across diverse and challenging manipulation tasks. Furthermore, we plan to open-source a portion of our self-constructed dataset to support future research in the community

iFlyBot-VLA Technical Report

TL;DR

iFlyBot-VLA introduces a latent-action–based VLA framework that blends implicit high-level planning with explicit low-level action tokens to align language, vision, and control. The approach uses a VQ-VAE latent action codebook and a FAST discrete token encoder to supervise a VLM and a flow-matching diffusion action expert, enabling continuous, precise manipulation. Training combines large-scale human/robot datasets with spatial QA data, preserving perception while enhancing 3D reasoning, and demonstrates strong generalization on LIBERO and real-world tasks. The work delivers state-of-the-art results, particularly in long-horizon and dexterous manipulation, and opens access to part of its dataset to foster community research.

Abstract

We introduce iFlyBot-VLA, a large-scale Vision-Language-Action (VLA) model trained under a novel framework. The main contributions are listed as follows: (1) a latent action model thoroughly trained on large-scale human and robotic manipulation videos; (2) a dual-level action representation framework that jointly supervises both the Vision-Language Model (VLM) and the action expert during training; (3) a mixed training strategy that combines robot trajectory data with general QA and spatial QA datasets, effectively enhancing the 3D perceptual and reasoning capabilities of the VLM backbone. Specifically, the VLM is trained to predict two complementary forms of actions: latent actions, derived from our latent action model pretrained on cross-embodiment manipulation data, which capture implicit high-level intentions; and structured discrete action tokens, obtained through frequency-domain transformations of continuous control signals, which encode explicit low-level dynamics. This dual supervision aligns the representation spaces of language, vision, and action, enabling the VLM to directly contribute to action generation. Experimental results on the LIBERO Franka benchmark demonstrate the superiority of our frame-work, while real-world evaluations further show that iFlyBot-VLA achieves competitive success rates across diverse and challenging manipulation tasks. Furthermore, we plan to open-source a portion of our self-constructed dataset to support future research in the community

Paper Structure

This paper contains 19 sections, 5 equations, 12 figures.

Figures (12)

  • Figure 1: The architecture of iFlyBot-VLA consists primarily of a language transformer backbone and an action expert network. The model generates executable robot actions through a combination of explicit and implicit planning. The key–value (KV) cache from the VLM component is passed to the downstream action expert, while the FAST Action Token—which corresponds to the implicit planning process—is not forwarded to the Action Expert
  • Figure 2: Architecture of the latent action token encoding expert network
  • Figure 3: The data used for training the latent action token encoding expert network
  • Figure 4: Overview of our dataset. The pretraining mixture consists of subsets of OXE, AgiBot_World, self-collected manipulation data, and VQA data. The left figure shows the proportion of different datasets in the pretraining mixture, while the right figure illustrates the composition of QA datasets during the pretraining stage.
  • Figure 5: Task suites in the LIBERO dataset.
  • ...and 7 more figures