Table of Contents
Fetching ...

InTraGen: Trajectory-controlled Video Generation for Object Interactions

Zuhao Liu, Aleksandar Yanev, Ahmad Mahmood, Ivan Nikolov, Saman Motamed, Wei-Shi Zheng, Xi Wang, Luc Van Gool, Danda Pani Paudel

TL;DR

This work introduces InTraGen, a pipeline for improved trajectory-based generation of object interaction scenarios, and introduces a multi-modal interaction encoding pipeline with an object ID injection mechanism that enriches object-environment interactions.

Abstract

Advances in video generation have significantly improved the realism and quality of created scenes. This has fueled interest in developing intuitive tools that let users leverage video generation as world simulators. Text-to-video (T2V) generation is one such approach, enabling video creation from text descriptions only. Yet, due to the inherent ambiguity in texts and the limited temporal information offered by text prompts, researchers have explored additional control signals like trajectory-guided systems, for more accurate T2V generation. Nonetheless, methods to evaluate whether T2V models can generate realistic interactions between multiple objects are lacking. We introduce InTraGen, a pipeline for improved trajectory-based generation of object interaction scenarios. We propose 4 new datasets and a novel trajectory quality metric to evaluate the performance of the proposed InTraGen. To achieve object interaction, we introduce a multi-modal interaction encoding pipeline with an object ID injection mechanism that enriches object-environment interactions. Our results demonstrate improvements in both visual fidelity and quantitative performance. Code and datasets are available at https://github.com/insait-institute/InTraGen

InTraGen: Trajectory-controlled Video Generation for Object Interactions

TL;DR

This work introduces InTraGen, a pipeline for improved trajectory-based generation of object interaction scenarios, and introduces a multi-modal interaction encoding pipeline with an object ID injection mechanism that enriches object-environment interactions.

Abstract

Advances in video generation have significantly improved the realism and quality of created scenes. This has fueled interest in developing intuitive tools that let users leverage video generation as world simulators. Text-to-video (T2V) generation is one such approach, enabling video creation from text descriptions only. Yet, due to the inherent ambiguity in texts and the limited temporal information offered by text prompts, researchers have explored additional control signals like trajectory-guided systems, for more accurate T2V generation. Nonetheless, methods to evaluate whether T2V models can generate realistic interactions between multiple objects are lacking. We introduce InTraGen, a pipeline for improved trajectory-based generation of object interaction scenarios. We propose 4 new datasets and a novel trajectory quality metric to evaluate the performance of the proposed InTraGen. To achieve object interaction, we introduce a multi-modal interaction encoding pipeline with an object ID injection mechanism that enriches object-environment interactions. Our results demonstrate improvements in both visual fidelity and quantitative performance. Code and datasets are available at https://github.com/insait-institute/InTraGen

Paper Structure

This paper contains 32 sections, 3 equations, 15 figures, 4 tables, 1 algorithm.

Figures (15)

  • Figure 1: Our pipeline generates realistic object interactions—such as crossing, collision, and falling—based on input trajectories. The figure displays four different scenarios. For each scenario, a desired trajectory is provided (left), and the corresponding keyframes (right) illustrate interactions that follow the specified trajectory conditions.
  • Figure 2: Overview of the Model Architecture: The figure illustrates the key components of InTraGen: the Trajectory Control Model and the DiT block. Object IDs and sparse poses are first encoded using a VAE to generate latent representations. The resulting latents are then integrated through the Multi-Modal Interaction Encoding pipeline (Figure \ref{['fig:multi-modal_interaction_encoding']}) to encode the rich object interaction information.
  • Figure 3: The process of multi-modal interaction encoding. The user-provided input trajectory is first transformed into sparse poses (representing the dynamic information) and object ID maps (containing static and interactive information). Then, two independent encoders are applied to encode their information, and then the fusion network combines the encodings and outputs the interaction condition for the DiT.
  • Figure 4: Qualitive evaluation of our model on Panda-70M and SoccerNet dataset
  • Figure 5: Comparison with different advanced and commercial models on an example from our dominos dataset. All other methods including I2VGen-XLzhang2023i2vgen, LUMA luma, and SEINE chen2023seine are failing in this scene. Our method shows the most realistic results in this interactive scenario. For SIENE and I2VGen-XL, the first frame from the ground truth is additionally passed.
  • ...and 10 more figures