Table of Contents
Fetching ...

BLM$_1$: A Boundless Large Model for Cross-Space, Cross-Task, and Cross-Embodiment Learning

Wentao Tan, Bowen Wang, Heng Zhi, Chenyu Liu, Zhe Li, Jian Liu, Zengrong Lin, Yukun Dai, Yipeng Chen, Wenjie Yang, Enci Xie, Hao Xue, Baixu Ji, Chen Xu, Zhibin Wang, Tianshi Wang, Lei Zhu, Heng Tao Shen

TL;DR

BLM$_1$ addresses the need for a unified model that generalizes across digital and physical spaces, multiple tasks, and diverse embodiments. It introduces a two-stage training paradigm that first injects embodied knowledge into a multimodal LLM and then trains a diffusion-based policy via an intent-bridging interface, without fine-tuning the backbone. The approach achieves state-of-the-art performance across digital and physical benchmarks, outperforming MLLMs, ELLMs, VLAs, and GMLMs by notable margins and demonstrating robust cross-embodiment generalization through a self-collected dataset spanning four robot embodiments and six tasks. This work provides a scalable framework for cross-space, cross-task, and cross-embodiment embodied intelligence with potential impact on future generalist robots and multimodal agents.

Abstract

Multimodal large language models (MLLMs) have advanced vision-language reasoning and are increasingly deployed in embodied agents. However, significant limitations remain: MLLMs generalize poorly across digital-physical spaces and embodiments; vision-language-action models (VLAs) produce low-level actions yet lack robust high-level embodied reasoning; and most embodied large language models (ELLMs) are constrained to digital-space with poor generalization to the physical world. Thus, unified models that operate seamlessly across digital and physical spaces while generalizing across embodiments and tasks remain absent. We introduce the \textbf{Boundless Large Model (BLM$_1$)}, a multimodal spatial foundation model that preserves instruction following and reasoning, incorporates embodied knowledge, and supports robust cross-embodiment control. BLM$_1$ integrates three key capabilities -- \textit{cross-space transfer, cross-task learning, and cross-embodiment generalization} -- via a two-stage training paradigm. Stage I injects embodied knowledge into the MLLM through curated digital corpora while maintaining language competence. Stage II trains a policy module through an intent-bridging interface that extracts high-level semantics from the MLLM to guide control, without fine-tuning the MLLM backbone. This process is supported by a self-collected cross-embodiment demonstration suite spanning four robot embodiments and six progressively challenging tasks. Evaluations across digital and physical benchmarks show that a single BLM$_1$ instance outperforms four model families -- MLLMs, ELLMs, VLAs, and GMLMs -- achieving $\sim\!\textbf{6%}$ gains in digital tasks and $\sim\!\textbf{3%}$ in physical tasks.

BLM$_1$: A Boundless Large Model for Cross-Space, Cross-Task, and Cross-Embodiment Learning

TL;DR

BLM addresses the need for a unified model that generalizes across digital and physical spaces, multiple tasks, and diverse embodiments. It introduces a two-stage training paradigm that first injects embodied knowledge into a multimodal LLM and then trains a diffusion-based policy via an intent-bridging interface, without fine-tuning the backbone. The approach achieves state-of-the-art performance across digital and physical benchmarks, outperforming MLLMs, ELLMs, VLAs, and GMLMs by notable margins and demonstrating robust cross-embodiment generalization through a self-collected dataset spanning four robot embodiments and six tasks. This work provides a scalable framework for cross-space, cross-task, and cross-embodiment embodied intelligence with potential impact on future generalist robots and multimodal agents.

Abstract

Multimodal large language models (MLLMs) have advanced vision-language reasoning and are increasingly deployed in embodied agents. However, significant limitations remain: MLLMs generalize poorly across digital-physical spaces and embodiments; vision-language-action models (VLAs) produce low-level actions yet lack robust high-level embodied reasoning; and most embodied large language models (ELLMs) are constrained to digital-space with poor generalization to the physical world. Thus, unified models that operate seamlessly across digital and physical spaces while generalizing across embodiments and tasks remain absent. We introduce the \textbf{Boundless Large Model (BLM)}, a multimodal spatial foundation model that preserves instruction following and reasoning, incorporates embodied knowledge, and supports robust cross-embodiment control. BLM integrates three key capabilities -- \textit{cross-space transfer, cross-task learning, and cross-embodiment generalization} -- via a two-stage training paradigm. Stage I injects embodied knowledge into the MLLM through curated digital corpora while maintaining language competence. Stage II trains a policy module through an intent-bridging interface that extracts high-level semantics from the MLLM to guide control, without fine-tuning the MLLM backbone. This process is supported by a self-collected cross-embodiment demonstration suite spanning four robot embodiments and six progressively challenging tasks. Evaluations across digital and physical benchmarks show that a single BLM instance outperforms four model families -- MLLMs, ELLMs, VLAs, and GMLMs -- achieving gains in digital tasks and in physical tasks.

Paper Structure

This paper contains 64 sections, 9 equations, 19 figures, 11 tables.

Figures (19)

  • Figure 1: BLM$_1$ is the first work to realize cross-space transfer, cross-task learning, and cross-embodiment generalization within a single multimodal spatial foundation. Evaluations show BLM$_1$ achieves SOTA performance over MLLMs, ELLMs, VLAs and GMLMs across digital and physical spaces.
  • Figure 2: The main framework of BLM$_1$. Multimodal inputs are first encoded and fused by a prompt engine, then passed to the MLLM backbone. BLM$_1$ follows a two-stage training paradigm. In Stage I, the model undergoes supervised fine-tuning on digital-space tasks to acquire embodied knowledge while preserving instruction-following capabilities. Stage II introduces an intent-bridging interface that connects the MLLM to a Diffusion Transformer policy head. This stage is trained using robot states, noisy actions, and a future-prediction loss. The result is a single unified model capable of handling both digital and physical tasks, enabling three boundless capabilities: cross-space transfer, cross-task learning, and cross-embodiment generalization.
  • Figure 3: Example of results comparison in multiple-choice questions.
  • Figure 4: Example of results comparison in free-form QA.
  • Figure 5: Cross-embodiment data collection pipeline for Stage II training.
  • ...and 14 more figures