Table of Contents
Fetching ...

EmboAlign: Aligning Video Generation with Compositional Constraints for Zero-Shot Manipulation

Gehao Zhang, Zhenyang Ni, Payal Mohapatra, Han Liu, Ruohan Zhang, Qi Zhu

TL;DR

A data-free framework that aligns VGM outputs with compositional constraints generated by vision-language models (VLMs) at inference time, and concludes that VLMs offer a capability complementary to VGMs: structured spatial reasoning that can identify the physical constraints critical to the success and safety of manipulation execution.

Abstract

Video generative models (VGMs) pretrained on large-scale internet data can produce temporally coherent rollout videos that capture rich object dynamics, offering a compelling foundation for zero-shot robotic manipulation. However, VGMs often produce physically implausible rollouts, and converting their pixel-space motion into robot actions through geometric retargeting further introduces cumulative errors from imperfect depth estimation and keypoint tracking. To address these challenges, we present \method{}, a data-free framework that aligns VGM outputs with compositional constraints generated by vision-language models (VLMs) at inference time. The key insight is that VLMs offer a capability complementary to VGMs: structured spatial reasoning that can identify the physical constraints critical to the success and safety of manipulation execution. Given a language instruction, \method{} uses a VLM to automatically extract a set of compositional constraints capturing task-specific requirements, which are then applied at two stages: (1) constraint-guided rollout selection, which scores and filters a batch of VGM rollouts to retain the most physically plausible candidate, and (2) constraint-based trajectory optimization, which uses the selected rollout as initialization and refines the robot trajectory under the same constraint set to correct retargeting errors. We evaluate \method{} on six real-robot manipulation tasks requiring precise, constraint-sensitive execution, improving the overall success rate by 43.3\% points over the strongest baseline without any task-specific training data.

EmboAlign: Aligning Video Generation with Compositional Constraints for Zero-Shot Manipulation

TL;DR

A data-free framework that aligns VGM outputs with compositional constraints generated by vision-language models (VLMs) at inference time, and concludes that VLMs offer a capability complementary to VGMs: structured spatial reasoning that can identify the physical constraints critical to the success and safety of manipulation execution.

Abstract

Video generative models (VGMs) pretrained on large-scale internet data can produce temporally coherent rollout videos that capture rich object dynamics, offering a compelling foundation for zero-shot robotic manipulation. However, VGMs often produce physically implausible rollouts, and converting their pixel-space motion into robot actions through geometric retargeting further introduces cumulative errors from imperfect depth estimation and keypoint tracking. To address these challenges, we present \method{}, a data-free framework that aligns VGM outputs with compositional constraints generated by vision-language models (VLMs) at inference time. The key insight is that VLMs offer a capability complementary to VGMs: structured spatial reasoning that can identify the physical constraints critical to the success and safety of manipulation execution. Given a language instruction, \method{} uses a VLM to automatically extract a set of compositional constraints capturing task-specific requirements, which are then applied at two stages: (1) constraint-guided rollout selection, which scores and filters a batch of VGM rollouts to retain the most physically plausible candidate, and (2) constraint-based trajectory optimization, which uses the selected rollout as initialization and refines the robot trajectory under the same constraint set to correct retargeting errors. We evaluate \method{} on six real-robot manipulation tasks requiring precise, constraint-sensitive execution, improving the overall success rate by 43.3\% points over the strongest baseline without any task-specific training data.
Paper Structure (22 sections, 9 equations, 6 figures, 3 tables)

This paper contains 22 sections, 9 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Video Generation Models can zero-shot generate rich motion priors for manipulation tasks, but hallucinations and retargeting errors may prevent these from translating into correct robot actions. We propose to use VLM-derived compositional constraints (e.g., $c_1$: placement alignment, $c_2$: top-down approach) to align VGM outputs at both the video selection and trajectory optimization stages, bridging the gap between generative motion diversity and the physical precision that real-world manipulation demands.
  • Figure 2: EmboAlign pipeline. Given a language instruction and RGB--D observations, a VLM generates compositional constraints while a VGM produces candidate rollout videos. A latent world model ranks rollouts by physical plausibility, then the constraint set filters candidates in descending-score order. The top valid rollout is retargeted into an end-effector trajectory and optimized under the same constraints for real-world execution.
  • Figure 3: Optimization constraints for real-robot evaluation. For each of the six manipulation tasks, a VLM automatically generates a set of constraints encoding spatial, kinematic, and safety requirements. These constraints serve as optimization objectives for trajectory refinement during execution.
  • Figure 4: Task examples. Example scenes for the real-robot evaluation tasks.
  • Figure 5: Constraint-based video selection. For the stack task (place the green block on top of the red block), task constraints filter candidate VGM rollouts by rejecting invalid behaviors. We show representative rejected rollouts and example candidates that pass all constraints and is selected for downstream retargeting and execution.
  • ...and 1 more figures