Table of Contents
Fetching ...

Training Video Foundation Models with NVIDIA NeMo

Zeeshan Patel, Ethan He, Parth Mannan, Xiaowei Ren, Ryan Wolf, Niket Agarwal, Jacob Huffman, Zhuoyao Wang, Carl Wang, Jack Chang, Yan Bai, Tommy Huang, Linnan Wang, Sahil Jain, Shanmugam Ramasamy, Joseph Jennings, Ekaterina Sirazitdinova, Oleg Sudakov, Mingyuan Ma, Bobby Chen, Forrest Lin, Hao Wang, Vasanth Rao Naik Sabavat, Sriharsha Niverty, Rong Ou, Pallab Bhattacharya, David Page, Nima Tajbakhsh, Ashwath Aithal

TL;DR

Training VFMs at scale requires handling massive multimodal video data and long-range temporal modeling. The paper presents an open-source NeMo-based end-to-end VFM framework that integrates NeMo Curator for data curation, Megatron Energon for multimodal dataloading, a diffusion Transformer training stack with 4D parallelism, and an efficient context-parallel inference engine. Key contributions include AdaLN-LoRA, ST-DiT, a customizable video tokenizer, and extensive algorithm-system co-design with benchmarking against Fast-DiT, showing superior MFU and near-linear scaling. The framework provides practical tooling and guidelines to researchers and engineers for building, fine-tuning, and serving large VFMs at scale across robotics, autonomous systems, and entertainment domains.

Abstract

Video Foundation Models (VFMs) have recently been used to simulate the real world to train physical AI systems and develop creative visual experiences. However, there are significant challenges in training large-scale, high quality VFMs that can generate high-quality videos. We present a scalable, open-source VFM training pipeline with NVIDIA NeMo, providing accelerated video dataset curation, multimodal data loading, and parallelized video diffusion model training and inference. We also provide a comprehensive performance analysis highlighting best practices for efficient VFM training and inference.

Training Video Foundation Models with NVIDIA NeMo

TL;DR

Training VFMs at scale requires handling massive multimodal video data and long-range temporal modeling. The paper presents an open-source NeMo-based end-to-end VFM framework that integrates NeMo Curator for data curation, Megatron Energon for multimodal dataloading, a diffusion Transformer training stack with 4D parallelism, and an efficient context-parallel inference engine. Key contributions include AdaLN-LoRA, ST-DiT, a customizable video tokenizer, and extensive algorithm-system co-design with benchmarking against Fast-DiT, showing superior MFU and near-linear scaling. The framework provides practical tooling and guidelines to researchers and engineers for building, fine-tuning, and serving large VFMs at scale across robotics, autonomous systems, and entertainment domains.

Abstract

Video Foundation Models (VFMs) have recently been used to simulate the real world to train physical AI systems and develop creative visual experiences. However, there are significant challenges in training large-scale, high quality VFMs that can generate high-quality videos. We present a scalable, open-source VFM training pipeline with NVIDIA NeMo, providing accelerated video dataset curation, multimodal data loading, and parallelized video diffusion model training and inference. We also provide a comprehensive performance analysis highlighting best practices for efficient VFM training and inference.

Paper Structure

This paper contains 27 sections, 1 equation, 13 figures, 2 tables.

Figures (13)

  • Figure 1: VFM Training Stack. NeMo provides an end-to-end stack for training video foundation models, leveraging NeMo Curator for video curation, Megatron Core for scaling transformer models, and the NeMo Framework for pre-training, fine-tuning, and accelerated inference.
  • Figure 2: Video Curation Pipeline. The video curation pipeline clips and processes large amounts of raw video. Then, the clips are sharded and stored on the cloud in the Webdataset format.
  • Figure 3: Auto-Balanced Curation Pipeline. Certain curation stages can be rate-limiting the throughput of the entire curation pipeline. We created an auto-balancing system to match the throughput of the overall pipeline by allocating the optimal number of workers depending on the curation stage.
  • Figure 4: Mixed Resolution Image-Video Training. We utilize sequence packing with padding to enable joint training of images and videos with different resolutions and video length.
  • Figure 5: Video Diffusion Transformer. Our pipeline consists of various input signals such as text, videos, and noise timestep which are compressed and used to train a video diffusion transformer.
  • ...and 8 more figures