Table of Contents
Fetching ...

Mind the Time: Temporally-Controlled Multi-Event Video Generation

Ziyi Wu, Aliaksandr Siarohin, Willi Menapace, Ivan Skorokhodov, Yuwei Fang, Varnith Chordia, Igor Gilitschenski, Sergey Tulyakov

TL;DR

This work tackles the problem of generating videos containing multiple events with precise temporal control. It introduces MinT, a temporally-grounded video generator built on a pre-trained latent DiT backbone, augmented with a temporally aware cross-attention mechanism using ReRoPE to bind each event to a specific time interval. The model supports scene-cut conditioning and a prompt enhancer that leverages LLMs to convert short prompts into rich global and temporal captions, enabling richer motion and smoother transitions. Experiments on HoldOut and StoryBench show state-of-the-art performance in event alignment and transition quality, with strong visual fidelity, demonstrating practical potential for controllable multi-event video generation, while outlining limitations and directions for future work.

Abstract

Real-world videos consist of sequences of events. Generating such sequences with precise temporal control is infeasible with existing video generators that rely on a single paragraph of text as input. When tasked with generating multiple events described using a single prompt, such methods often ignore some of the events or fail to arrange them in the correct order. To address this limitation, we present MinT, a multi-event video generator with temporal control. Our key insight is to bind each event to a specific period in the generated video, which allows the model to focus on one event at a time. To enable time-aware interactions between event captions and video tokens, we design a time-based positional encoding method, dubbed ReRoPE. This encoding helps to guide the cross-attention operation. By fine-tuning a pre-trained video diffusion transformer on temporally grounded data, our approach produces coherent videos with smoothly connected events. For the first time in the literature, our model offers control over the timing of events in generated videos. Extensive experiments demonstrate that MinT outperforms existing commercial and open-source models by a large margin.

Mind the Time: Temporally-Controlled Multi-Event Video Generation

TL;DR

This work tackles the problem of generating videos containing multiple events with precise temporal control. It introduces MinT, a temporally-grounded video generator built on a pre-trained latent DiT backbone, augmented with a temporally aware cross-attention mechanism using ReRoPE to bind each event to a specific time interval. The model supports scene-cut conditioning and a prompt enhancer that leverages LLMs to convert short prompts into rich global and temporal captions, enabling richer motion and smoother transitions. Experiments on HoldOut and StoryBench show state-of-the-art performance in event alignment and transition quality, with strong visual fidelity, demonstrating practical potential for controllable multi-event video generation, while outlining limitations and directions for future work.

Abstract

Real-world videos consist of sequences of events. Generating such sequences with precise temporal control is infeasible with existing video generators that rely on a single paragraph of text as input. When tasked with generating multiple events described using a single prompt, such methods often ignore some of the events or fail to arrange them in the correct order. To address this limitation, we present MinT, a multi-event video generator with temporal control. Our key insight is to bind each event to a specific period in the generated video, which allows the model to focus on one event at a time. To enable time-aware interactions between event captions and video tokens, we design a time-based positional encoding method, dubbed ReRoPE. This encoding helps to guide the cross-attention operation. By fine-tuning a pre-trained video diffusion transformer on temporally grounded data, our approach produces coherent videos with smoothly connected events. For the first time in the literature, our model offers control over the timing of events in generated videos. Extensive experiments demonstrate that MinT outperforms existing commercial and open-source models by a large margin.

Paper Structure

This paper contains 34 sections, 18 equations, 19 figures, 4 tables.

Figures (19)

  • Figure 1: Time-controlled multi-event video generation with MinT. Given a sequence of event text prompts and their desired start and end timestamps, MinT synthesizes smoothly connected events with consistent subjects and backgrounds. In addition, it can control the time span of each event flexibly. Here, we show the results of sequential gestures, daily activities, facial expressions, and cat movements.
  • Figure 2: Multi-event video generation results from SOTA video generators and MinT. We run two open-source models CogVideoX-5B CogVideoX and Mochi 1 Mochi, and two commercial models Kling 1.5 KLING1_5 and Gen-3 Alpha Gen3Alpha to generate sequential events. All of them only generate a subset of events while ignoring the remaining ones. In contrast, MinT generates a natural video with all events smoothly connected. Please refer to Appendix \ref{['app:more-compare-with-sota']} and our https://mint-video.github.io/#compare-with-sota for more comparisons. Comparisons with Sora Sora can be found https://mint-video.github.io/#compare-with-sora.
  • Figure 3: MinT framework. (a) Our model takes in a global caption describing the overall video, and a list of temporal captions specifying the sequential events. We bind each event to a time range, enabling temporal control of the generated events. (b) To condition the video DiT on temporal captions, we introduce a new temporal cross-attention layer in each DiT block, which (c) concatenates the text embedding of all event prompts and leverages a time-aware positional encoding (Pos.Enc.) method to associate each event to its corresponding frames based on the event timestamps. MinT supports an additional scene cut conditioning, which can control the shot transition of the video.
  • Figure 4: Comparison of vanilla RoPE and our Rescaled RoPE. We use the same random vector for video tokens and text embeddings to only visualize the bias introduced by positional encoding. (a) Vanilla RoPE uses raw timestamps as the rotation angle, where frames within one event might be biased to the wrong text. (b) We instead rescale all events to have the same length $L$, so that video tokens always attend the most to the current event. In addition, frames at event boundaries attend to adjacent events equally.
  • Figure 5: T2V results on HoldOut and StoryBench. For CogVideoX and Mochi we concatenated the events into a single prompt, similar to the Concat baseline. Metrics in the first row measure visual quality, while those in the second row focus on the text alignment and transition smoothness between events. MinT performs the best in event-related metrics while maintaining a high visual quality.
  • ...and 14 more figures