Table of Contents
Fetching ...

BiTAgent: A Task-Aware Modular Framework for Bidirectional Coupling between Multimodal Large Language Models and World Models

Yu-Wei Zhan, Xin Wang, Pengzhe Mao, Tongtong Feng, Ren Wang, Wenwu Zhu

TL;DR

BiTAgent tackles open-ended embodied intelligence by coupling Multimodal Large Language Models with World Models in a bidirectional framework. It introduces a Task-Aware Dynamic Joint Learning architecture with Task-Aware Modular Fusion and a task-conditioned reward mechanism, enabling semantic guidance of dynamics and feedback-driven semantic refinement. The approach yields a unified optimization objective and demonstrates superior multi-task and cross-environment generalization on the DeepMind Control Suite compared to state-of-the-art baselines. This work advances open-ended embodied decision-making through tight semantic-dynamics integration.

Abstract

Building generalist embodied agents requires a unified system that can interpret multimodal goals, model environment dynamics, and execute reliable actions across diverse real-world tasks. Multimodal large language models (MLLMs) offer strong semantic priors and cross-modal generalization, while world models (WMs) provide actionable latent dynamics for prediction and control. Their combination holds promise for open-ended embodied intelligence, yet introduces two key challenges: (1) establishing a tight coupling between the semantic intent from MLLMs and the dynamic state representations within the WM's latent space, and (2) achieving task-aware adaptability that supports multi-task learning and cross-environment generalization. To address these limitations, we propose BiTAgent, a task-aware dynamic joint framework that enables bidirectional coupling between MLLMs and WMs. BiTAgent establishes two complementary pathways: a forward path that injects MLLM representations into the WM's latent space for semantically guided imagination, and a backward path where WM-generated feedback refines the MLLM's semantic space via dense text-conditioned rewards. This bidirectional interaction is realized through three synergistic components: Task-Aware Dynamic Joint Learning, Task-Aware Behavior Learning, and MLLM-WM Joint Optimization, which together harmonize semantic reasoning and dynamic prediction. Extensive experiments across multi-task and cross-environment settings demonstrate superior stability and generalization over state-of-the-art baselines, marking a step toward open-ended embodied learning.

BiTAgent: A Task-Aware Modular Framework for Bidirectional Coupling between Multimodal Large Language Models and World Models

TL;DR

BiTAgent tackles open-ended embodied intelligence by coupling Multimodal Large Language Models with World Models in a bidirectional framework. It introduces a Task-Aware Dynamic Joint Learning architecture with Task-Aware Modular Fusion and a task-conditioned reward mechanism, enabling semantic guidance of dynamics and feedback-driven semantic refinement. The approach yields a unified optimization objective and demonstrates superior multi-task and cross-environment generalization on the DeepMind Control Suite compared to state-of-the-art baselines. This work advances open-ended embodied decision-making through tight semantic-dynamics integration.

Abstract

Building generalist embodied agents requires a unified system that can interpret multimodal goals, model environment dynamics, and execute reliable actions across diverse real-world tasks. Multimodal large language models (MLLMs) offer strong semantic priors and cross-modal generalization, while world models (WMs) provide actionable latent dynamics for prediction and control. Their combination holds promise for open-ended embodied intelligence, yet introduces two key challenges: (1) establishing a tight coupling between the semantic intent from MLLMs and the dynamic state representations within the WM's latent space, and (2) achieving task-aware adaptability that supports multi-task learning and cross-environment generalization. To address these limitations, we propose BiTAgent, a task-aware dynamic joint framework that enables bidirectional coupling between MLLMs and WMs. BiTAgent establishes two complementary pathways: a forward path that injects MLLM representations into the WM's latent space for semantically guided imagination, and a backward path where WM-generated feedback refines the MLLM's semantic space via dense text-conditioned rewards. This bidirectional interaction is realized through three synergistic components: Task-Aware Dynamic Joint Learning, Task-Aware Behavior Learning, and MLLM-WM Joint Optimization, which together harmonize semantic reasoning and dynamic prediction. Extensive experiments across multi-task and cross-environment settings demonstrate superior stability and generalization over state-of-the-art baselines, marking a step toward open-ended embodied learning.

Paper Structure

This paper contains 27 sections, 14 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: BiTAgent enables task-aware bidirectional coupling between MLLM and World Model. In the forward path, MLLM semantics are injected into the WM via Task-Aware Modular Fusion for semantically guided imagination. In the backward path, task-conditioned imagined trajectories produce rewards and actions, which are backpropagated through the joint loss to refine MLLM.
  • Figure 2: Overview of the proposed BiTAgent framework. It establishes bidirectional coupling between MLLM and World Model. In the forward path, semantic representations produced by MLLM are injected into the WM’s latent space via the Task-Aware Modular Fusion mechanism, enabling semantically guided imagination rather than purely physics-driven rollouts. In the backward path, Task-Aware Behavior Learning leverages WM-generated imagined trajectories to compute dense text-aligned rewards, which provide gradient feedback that reshapes the MLLM’s semantic space.
  • Figure 3: Cross-environment generalization of agents trained in the Walker domain. The top row shows evaluation results in the Quadruped environment, while the bottom row presents results in the Stickman environment.
  • Figure 4: Visualization of task-conditioned imagined trajectories. For each environment, the top row shows real observations and the bottom row shows reconstructions decoded from the task-conditioned imagination trajectories.