Transform Trained Transformer: Accelerating Naive 4K Video Generation Over 10$\times$
Authors
Jiangning Zhang, Junwei Zhu, Teng Hu, Yabiao Wang, Donghao Luo, Weijian Cao, Zhenye Gan, Xiaobin Hu, Zhucun Xue, Chengjie Wang
Abstract
Native 4K (21603840) video generation remains a critical challenge due to the quadratic computational explosion of full-attention as spatiotemporal resolution increases, making it difficult for models to strike a balance between efficiency and quality. This paper proposes a novel Transformer retrofit strategy termed (ransform rained ransformer) that, without altering the core architecture of full-attention pretrained models, significantly reduces compute requirements by optimizing their forward logic. Specifically, introduces a multi-scale weight-sharing window attention mechanism and, via hierarchical blocking together with an axis-preserving full-attention design, can effect an "attention pattern" transformation of a pretrained model using only modest compute and data. Results on 4K-VBench show that substantially outperforms existing approaches: while delivering performance improvements (+4.29 VQA and +0.08 VTC), it accelerates native 4K video generation by more than 10. Project page at https://zhangzjn.github.io/projects/T3-Video