Table of Contents
Fetching ...

ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment

Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang Zeng, Junjin Xiao, Xinyuan Chang, Feng Xiong, Xing Wei, Zhiheng Ma, Mu Xu

Abstract

Video-based world models offer a powerful paradigm for embodied simulation and planning, yet state-of-the-art models often generate physically implausible manipulations - such as object penetration and anti-gravity motion - due to training on generic visual data and likelihood-based objectives that ignore physical laws. We present ABot-PhysWorld, a 14B Diffusion Transformer model that generates visually realistic, physically plausible, and action-controllable videos. Built on a curated dataset of three million manipulation clips with physics-aware annotation, it uses a novel DPO-based post-training framework with decoupled discriminators to suppress unphysical behaviors while preserving visual quality. A parallel context block enables precise spatial action injection for cross-embodiment control. To better evaluate generalization, we introduce EZSbench, the first training-independent embodied zero-shot benchmark combining real and synthetic unseen robot-task-scene combinations. It employs a decoupled protocol to separately assess physical realism and action alignment. ABot-PhysWorld achieves new state-of-the-art performance on PBench and EZSbench, surpassing Veo 3.1 and Sora v2 Pro in physical plausibility and trajectory consistency. We will release EZSbench to promote standardized evaluation in embodied video generation.

ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment

Abstract

Video-based world models offer a powerful paradigm for embodied simulation and planning, yet state-of-the-art models often generate physically implausible manipulations - such as object penetration and anti-gravity motion - due to training on generic visual data and likelihood-based objectives that ignore physical laws. We present ABot-PhysWorld, a 14B Diffusion Transformer model that generates visually realistic, physically plausible, and action-controllable videos. Built on a curated dataset of three million manipulation clips with physics-aware annotation, it uses a novel DPO-based post-training framework with decoupled discriminators to suppress unphysical behaviors while preserving visual quality. A parallel context block enables precise spatial action injection for cross-embodiment control. To better evaluate generalization, we introduce EZSbench, the first training-independent embodied zero-shot benchmark combining real and synthetic unseen robot-task-scene combinations. It employs a decoupled protocol to separately assess physical realism and action alignment. ABot-PhysWorld achieves new state-of-the-art performance on PBench and EZSbench, surpassing Veo 3.1 and Sora v2 Pro in physical plausibility and trajectory consistency. We will release EZSbench to promote standardized evaluation in embodied video generation.
Paper Structure (27 sections, 2 equations, 16 figures, 3 tables)

This paper contains 27 sections, 2 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Overview of the data curation pipeline. (a) shows the multi-stage filtering and balancing flow from raw aggregation ($\sim$3M clips) to training-ready splits (SFT, RL, and A2V data). (b) Task-aware quota allocation: head tasks are capped at 8--15%, body tasks are uniformly sampled at 40--50%, and long-tail tasks are fully preserved to maximize task diversity. (c) Dataset and robot type distribution: the left ring shows the original composition and the right ring shows the rebalanced result after hierarchical sampling. (d) Physics-aware video captioning pipeline: a perception module (Qwen3-VL 32B) extracts structured physical attributes, followed by a writing module (Qwen3 32B FP8) that generates four-phase captions covering scene setup, action detail, state transition, and camera summary.
  • Figure 2: Two-stage training pipeline. Stage 1: SFT on the DiT to predict future frames from observations and instructions. Stage 2: generate $N$ candidates, score via physics checklist, and apply DPO via LoRA on frozen DiT weights.
  • Figure 3: Construction pipeline of the EZSbench. Top: dual-source image augmentation---Branch 1 generates synthetic initial observations via text-to-image (Nano Banana) by varying robot morphology, scene, task, and viewpoint; Branch 2 applies VLM-guided background editing to real-world images while preserving foreground interactions. Down: three-stage dense description synthesis---visual anchoring grounds the scene layout and object coordinates, action simulation infers kinematically compliant trajectories with micro-physical interactions, and narrative synthesis produces a documentary-style caption integrating initial state, trajectory, and final state.
  • Figure 4: Architecture of the action-conditioned video generation model. We selectively duplicate DiT blocks as parallel context blocks to process action maps, and fuse their outputs residually into the main DiT.
  • Figure 5: Qualitative comparison on PAI-Bench.
  • ...and 11 more figures