Table of Contents
Fetching ...

Target-Aware Video Diffusion Models

Taeksoo Kim, Hanbyul Joo

TL;DR

This work tackles the challenge of generating videos where an actor plausibly interacts with a specified target using only a segmentation mask and a text prompt. By extending a baseline image-to-video diffusion model to accept a target mask and introducing a [TGT] token, the approach uses a novel cross-attention loss to align attention maps with the target region, selectively supervising semantically relevant transformer blocks. A dedicated dataset and careful training make the model adept at accurate target interactions, enabling applications in video content creation and zero-shot 3D HOI motion synthesis for robotics-like planning. The method delivers improved target alignment over baselines while maintaining video quality, and demonstrates practical utility in long-form content generation and physics-based imitation learning pipelines.

Abstract

We present a target-aware video diffusion model that generates videos from an input image in which an actor interacts with a specified target while performing a desired action. The target is defined by a segmentation mask and the desired action is described via a text prompt. Unlike existing controllable image-to-video diffusion models that often rely on dense structural or motion cues to guide the actor's movements toward the target, our target-aware model requires only a simple mask to indicate the target, leveraging the generalization capabilities of pretrained models to produce plausible actions. This makes our method particularly effective for human-object interaction (HOI) scenarios, where providing precise action guidance is challenging, and further enables the use of video diffusion models for high-level action planning in applications such as robotics. We build our target-aware model by extending a baseline model to incorporate the target mask as an additional input. To enforce target awareness, we introduce a special token that encodes the target's spatial information within the text prompt. We then fine-tune the model with our curated dataset using a novel cross-attention loss that aligns the cross-attention maps associated with this token with the input target mask. To further improve performance, we selectively apply this loss to the most semantically relevant transformer blocks and attention regions. Experimental results show that our target-aware model outperforms existing solutions in generating videos where actors interact accurately with the specified targets. We further demonstrate its efficacy in two downstream applications: video content creation and zero-shot 3D HOI motion synthesis.

Target-Aware Video Diffusion Models

TL;DR

This work tackles the challenge of generating videos where an actor plausibly interacts with a specified target using only a segmentation mask and a text prompt. By extending a baseline image-to-video diffusion model to accept a target mask and introducing a [TGT] token, the approach uses a novel cross-attention loss to align attention maps with the target region, selectively supervising semantically relevant transformer blocks. A dedicated dataset and careful training make the model adept at accurate target interactions, enabling applications in video content creation and zero-shot 3D HOI motion synthesis for robotics-like planning. The method delivers improved target alignment over baselines while maintaining video quality, and demonstrates practical utility in long-form content generation and physics-based imitation learning pipelines.

Abstract

We present a target-aware video diffusion model that generates videos from an input image in which an actor interacts with a specified target while performing a desired action. The target is defined by a segmentation mask and the desired action is described via a text prompt. Unlike existing controllable image-to-video diffusion models that often rely on dense structural or motion cues to guide the actor's movements toward the target, our target-aware model requires only a simple mask to indicate the target, leveraging the generalization capabilities of pretrained models to produce plausible actions. This makes our method particularly effective for human-object interaction (HOI) scenarios, where providing precise action guidance is challenging, and further enables the use of video diffusion models for high-level action planning in applications such as robotics. We build our target-aware model by extending a baseline model to incorporate the target mask as an additional input. To enforce target awareness, we introduce a special token that encodes the target's spatial information within the text prompt. We then fine-tune the model with our curated dataset using a novel cross-attention loss that aligns the cross-attention maps associated with this token with the input target mask. To further improve performance, we selectively apply this loss to the most semantically relevant transformer blocks and attention regions. Experimental results show that our target-aware model outperforms existing solutions in generating videos where actors interact accurately with the specified targets. We further demonstrate its efficacy in two downstream applications: video content creation and zero-shot 3D HOI motion synthesis.

Paper Structure

This paper contains 23 sections, 7 equations, 22 figures, 5 tables.

Figures (22)

  • Figure 1: Target-Aware Video Diffusion Models. Given an input image, our target-aware video diffusion model generates a video in which an actor accurately interacts with a specified target. The target is indicated to the model via a segmentation mask, while the desired action with the target is described using a text prompt.
  • Figure 2: Injecting the Extra Mask Condition. We condition the noisy video latent with a binary segmentation mask of the target to incorporate the spatial information of the target during generation.
  • Figure 3: Target-aware video diffusion models. We fine-tune the pretrained image-to-video diffusion model with additional cross-attention loss to make the model utilize the additional mask input.
  • Figure 4: Effect of our cross-attention loss. Our cross-attention loss effectively guides the model to focus on the target region.
  • Figure 5: Selective cross-attention loss. We apply our loss on particular transformer blocks that effectively capture semantic information and cross-attention areas that significantly impact the model's target awareness.
  • ...and 17 more figures