Table of Contents
Fetching ...

DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers

Lizhen Wang, Zhurong Xia, Tianshu Hu, Pengrui Wang, Pengfei Wei, Zerong Zheng, Ming Zhou, Yuan Zhang, Mingyuan Gao

TL;DR

DreamActor-H1 tackles high-fidelity human–product video generation by integrating a Diffusion Transformer with appearance fusion, motion guidance from 3D body templates and product boxes, and semantic text cues. The method preserves both human and product identities, ensures natural hand–product alignment, and maintains material fidelity through masked cross-attention and structured text embeddings. A large hybrid dataset and a robust training regime underpin performance, demonstrated via quantitative metrics and user studies that favor DreamActor-H1 over prior work. This approach promises practical impact for personalized e-commerce content and interactive media by delivering more realistic and controllable human–product demonstrations.

Abstract

In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions. Project page: https://lizhenwangt.github.io/DreamActor-H1/.

DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers

TL;DR

DreamActor-H1 tackles high-fidelity human–product video generation by integrating a Diffusion Transformer with appearance fusion, motion guidance from 3D body templates and product boxes, and semantic text cues. The method preserves both human and product identities, ensures natural hand–product alignment, and maintains material fidelity through masked cross-attention and structured text embeddings. A large hybrid dataset and a robust training regime underpin performance, demonstrated via quantitative metrics and user studies that favor DreamActor-H1 over prior work. This approach promises practical impact for personalized e-commerce content and interactive media by delivering more realistic and controllable human–product demonstrations.

Abstract

In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions. Project page: https://lizhenwangt.github.io/DreamActor-H1/.

Paper Structure

This paper contains 16 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: DreamActor-H1 can generate high-fidelity and photo-realistic human-product demonstration videos from human and product reference images.
  • Figure 2: The pipeline of DreamActor-H1 leverages a DiT architecture, starting with dataset preparation where a VLM describes product and human images, followed by pose estimation and bounding box detection on training videos. During training, human poses and product boxes integrate with video noise for motion guidance, while a VAE encodes input images for appearance guidance; human-product descriptions are fed into the model via a text encoder. The model incorporates full attention, reference attention, and object attention (with product latents as inputs), with the reference and object attention mechanisms detailed at the top of the figure.
  • Figure 3: During inference, our framework retrieves optimal motion templates from pre-defined pools and adapts object box scaling via joint analysis of reference human/product images, enabling pose-coherent animations.
  • Figure 4: Comparisons with AnchorCrafter xu2024anchorcrafter, Phantom liu2025phantom, VACE vace and UniAnimate-DiT※ wang2024unianimate. Note that we only generate 3 videos for AnchorCrafter, and UniAnimate-DiT uses our first frames and pose sequences as inputs.
  • Figure 5: Ablation studies with "Ours baseline" (w/o object attention and text input) and "Ours w/o text".
  • ...and 2 more figures