DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers

Lizhen Wang; Zhurong Xia; Tianshu Hu; Pengrui Wang; Pengfei Wei; Zerong Zheng; Ming Zhou; Yuan Zhang; Mingyuan Gao

DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers

Lizhen Wang, Zhurong Xia, Tianshu Hu, Pengrui Wang, Pengfei Wei, Zerong Zheng, Ming Zhou, Yuan Zhang, Mingyuan Gao

TL;DR

DreamActor-H1 tackles high-fidelity human–product video generation by integrating a Diffusion Transformer with appearance fusion, motion guidance from 3D body templates and product boxes, and semantic text cues. The method preserves both human and product identities, ensures natural hand–product alignment, and maintains material fidelity through masked cross-attention and structured text embeddings. A large hybrid dataset and a robust training regime underpin performance, demonstrated via quantitative metrics and user studies that favor DreamActor-H1 over prior work. This approach promises practical impact for personalized e-commerce content and interactive media by delivering more realistic and controllable human–product demonstrations.

Abstract

In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions. Project page: https://lizhenwangt.github.io/DreamActor-H1/.

DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers

TL;DR

Abstract

DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)