HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning

Quanxin Shou; Fangqi Zhu; Shawn Chen; Puxin Yan; Zhengyang Yan; Yikun Miao; Xiaoyi Pang; Zicong Hong; Ruikai Shi; Hao Huang; Jie Zhang; Song Guo

HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning

Quanxin Shou, Fangqi Zhu, Shawn Chen, Puxin Yan, Zhengyang Yan, Yikun Miao, Xiaoyi Pang, Zicong Hong, Ruikai Shi, Hao Huang, Jie Zhang, Song Guo

TL;DR

HALO is proposed, a unified VLA model that enables embodied multimodal chain-of-thought (EM-CoT) reasoning through a sequential process of textual task reasoning, visual subgoal prediction for fine-grained guidance, and EM-CoT-augmented action prediction.

Abstract

Vision-Language-Action (VLA) models have shown strong performance in robotic manipulation, but often struggle in long-horizon or out-of-distribution scenarios due to the lack of explicit mechanisms for multimodal reasoning and anticipating how the world will evolve under action. Recent works introduce textual chain-of-thought or visual subgoal prediction within VLA models to reason, but still fail to offer a unified human-like reasoning framework for joint textual reasoning, visual foresight, and action prediction. To this end, we propose HALO, a unified VLA model that enables embodied multimodal chain-of-thought (EM-CoT) reasoning through a sequential process of textual task reasoning, visual subgoal prediction for fine-grained guidance, and EM-CoT-augmented action prediction. We instantiate HALO with a Mixture-of-Transformers (MoT) architecture that decouples semantic reasoning, visual foresight, and action prediction into specialized experts while allowing seamless cross-expert collaboration. To enable HALO learning at scale, we introduce an automated pipeline to synthesize EM-CoT training data along with a carefully crafted training recipe. Extensive experiments demonstrate that: (1) HALO achieves superior performance in both simulated and real-world environments, surpassing baseline policy pi_0 by 34.1% on RoboTwin benchmark; (2) all proposed components of the training recipe and EM-CoT design help improve task success rate; and (3) HALO exhibits strong generalization capabilities under aggressive unseen environmental randomization with our proposed EM-CoT reasoning.

HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning

TL;DR

Abstract

Paper Structure (30 sections, 3 equations, 8 figures, 5 tables, 2 algorithms)

This paper contains 30 sections, 3 equations, 8 figures, 5 tables, 2 algorithms.

Introduction
Related Work
Vision-Language-Action Models.
Textual and Visual Reasoning for Robotic Intelligence.
Method
Problem Formulation
Unified Architecture
EM-CoT Data Pipeline
Training Recipe
Stage 1: Versatile Pre-training.
Stage 2: EM-CoT-Augmented Fine-tuning.
Experiments
Experiment Settings
Simulation Results
Ablation Study
...and 15 more sections

Figures (8)

Figure 1: Halo first performs textual reasoning and task planning, then predicts visual subgoals for fine-grained guidance, and finally generates actions conditioned on EM-CoT. This process is implemented using a Mixture-of-Transformers (MoT) architecture that integrates a multimodal understanding expert, a visual generation expert, and an action prediction expert together through shared self-attention.
Figure 2: Overview of EM-CoT Data Synthesis Pipeline. The pipeline converts raw robotic trajectories into EM-CoT data in three phases: (1) action primitives are extracted from robot proprioception via rule-based matching; (2) a VLM acts as an annotator to generate task plans, decompose trajectories into subtasks, and align each subtask with explicit textual reasoning; and (3) the terminal frame of each subtask is selected as a visual subgoal image, producing structured embodied multimodal chain-of-thought supervision.
Figure 3: Attention Masking Strategy for EM-CoT. (1) Spatial and semantic tokens utilize bidirectional attention within frames. (2) Noise tokens attend bidirectionally to each other. (3) Cross-modality interactions and text generation follow causal constraints. (4) Non-noise tokens are masked from noise tokens to prevent leakage.
Figure 4: Overview of dataset recipe.Halo training involves two stages: Stage 1 pre-trains on general VQA, visual generation, and action prediction to build foundation skills; Stage 2 introduces EM-CoT–augmented fine-tuning to inject multi-step reasoning and foresight capabilities.
Figure 5: Qualitative Analysis of the EM-CoT. We highlight (i) accurate textual reasoning and subgoal image generation in the clean setting, and (ii) robust EM-CoT generalization under the unseen domain-randomized settings.
...and 3 more figures

HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning

TL;DR

Abstract

HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)