Table of Contents
Fetching ...

Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model

Haoyang Huang, Guoqing Ma, Nan Duan, Xing Chen, Changyi Wan, Ranchen Ming, Tianyu Wang, Bo Wang, Zhiying Lu, Aojie Li, Xianfang Zeng, Xinhao Zhang, Gang Yu, Yuhe Yin, Qiling Wu, Wen Sun, Kang An, Xin Han, Deshan Sun, Wei Ji, Bizhu Huang, Brian Li, Chenfei Wu, Guanzhe Huang, Huixin Xiong, Jiaxin He, Jianchang Wu, Jianlong Yuan, Jie Wu, Jiashuai Liu, Junjing Guo, Kaijun Tan, Liangyu Chen, Qiaohui Chen, Ran Sun, Shanshan Yuan, Shengming Yin, Sitong Liu, Wei Chen, Yaqi Dai, Yuchu Luo, Zheng Ge, Zhisheng Guan, Xiaoniu Song, Yu Zhou, Binxing Jiao, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Yi Xiu, Yibo Zhu, Heung-Yeung Shum, Daxin Jiang

TL;DR

This work introduces Step-Video-TI2V, a 30B open-source TI2V model built by extending Step-Video-T2V with Image Conditioning and Motion Conditioning to enable image-initiated video generation and controllable motion. A Video-VAE-based latent fusion and an AdaLN-Single conditioning pathway facilitate explicit motion control, while a new Step-Video-TI2V-Eval benchmark provides real-world and anime-style TI2V evaluation across instruction adherence, subject-background consistency, and physical realism. Empirical results show state-of-the-art performance on Step-Video-TI2V-Eval and VBench-I2V against both open-source and commercial engines, with caveats around instruction adherence due to training data distribution. The work also emphasizes the importance of anime-style data in performance and offers a valuable benchmark for guiding future TI2V research and development.

Abstract

We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs. We build Step-Video-TI2V-Eval as a new benchmark for the text-driven image-to-video task and compare Step-Video-TI2V with open-source and commercial TI2V engines using this dataset. Experimental results demonstrate the state-of-the-art performance of Step-Video-TI2V in the image-to-video generation task. Both Step-Video-TI2V and Step-Video-TI2V-Eval are available at https://github.com/stepfun-ai/Step-Video-TI2V.

Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model

TL;DR

This work introduces Step-Video-TI2V, a 30B open-source TI2V model built by extending Step-Video-T2V with Image Conditioning and Motion Conditioning to enable image-initiated video generation and controllable motion. A Video-VAE-based latent fusion and an AdaLN-Single conditioning pathway facilitate explicit motion control, while a new Step-Video-TI2V-Eval benchmark provides real-world and anime-style TI2V evaluation across instruction adherence, subject-background consistency, and physical realism. Empirical results show state-of-the-art performance on Step-Video-TI2V-Eval and VBench-I2V against both open-source and commercial engines, with caveats around instruction adherence due to training data distribution. The work also emphasizes the importance of anime-style data in performance and offers a valuable benchmark for guiding future TI2V research and development.

Abstract

We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs. We build Step-Video-TI2V-Eval as a new benchmark for the text-driven image-to-video task and compare Step-Video-TI2V with open-source and commercial TI2V engines using this dataset. Experimental results demonstrate the state-of-the-art performance of Step-Video-TI2V in the image-to-video generation task. Both Step-Video-TI2V and Step-Video-TI2V-Eval are available at https://github.com/stepfun-ai/Step-Video-TI2V.

Paper Structure

This paper contains 14 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Overview of Step-Video-TI2V. Based on the pre-trained T2V model, we introduce two key modifications: Image Conditioning and Motion Conditioning. These enhancements enable video generation from a given image while allowing users to adjust the dynamic level of the output video.