Table of Contents
Fetching ...

Kling-Omni Technical Report

Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, Xiao Hu, Xiaohua Hu, Boyuan Jiang, Fangyuan Kong, Hang Li, Jie Li, Qingyu Li, Shen Li, Xiaohan Li, Yan Li, Jiajun Liang, Borui Liao, Yiqiao Liao, Weihong Lin, Quande Liu, Xiaokun Liu, Yilun Liu, Yuliang Liu, Shun Lu, Hangyu Mao, Yunyao Mao, Haodong Ouyang, Wenyu Qin, Wanqi Shi, Xiaoyu Shi, Lianghao Su, Haozhi Sun, Peiqin Sun, Pengfei Wan, Chao Wang, Chenyu Wang, Meng Wang, Qiulin Wang, Runqi Wang, Xintao Wang, Xuebo Wang, Zekun Wang, Min Wei, Tiancheng Wen, Guohao Wu, Xiaoshi Wu, Zhenhua Wu, Da Xie, Yingtong Xiong, Yulong Xu, Sile Yang, Zikang Yang, Weicai Ye, Ziyang Yuan, Shenglong Zhang, Shuaiyu Zhang, Yuanxing Zhang, Yufan Zhang, Wenzheng Zhao, Ruiliang Zhou, Yan Zhou, Guosheng Zhu, Yongjie Zhu

TL;DR

Kling-Omni tackles fragmentation in multimodal video creation by unifying generation, editing, and reasoning under a single MVL framework. It integrates a Prompt Enhancer, Omni-Generator, and Multimodal Super-Resolution within a diffusion-transformer pipeline, trained end-to-end through pre-training, supervised fine-tuning, RL with Direct Preference Optimization, and two-stage distillation for efficiency. A dedicated data system combining real-world and synthetic data, plus rigorous processing and alignment, underpins stable temporal coherence and cross-modal fidelity. Empirical evaluation with the OmniVideo-1.0 benchmark shows state-of-the-art performance on reference-based generation and editing, along with advanced capabilities like temporal narration and visual-signal controllable generation. Collectively, Kling-Omni represents a major step toward multimodal world simulators that perceive, reason, generate, and interact in dynamic environments.

Abstract

We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.

Kling-Omni Technical Report

TL;DR

Kling-Omni tackles fragmentation in multimodal video creation by unifying generation, editing, and reasoning under a single MVL framework. It integrates a Prompt Enhancer, Omni-Generator, and Multimodal Super-Resolution within a diffusion-transformer pipeline, trained end-to-end through pre-training, supervised fine-tuning, RL with Direct Preference Optimization, and two-stage distillation for efficiency. A dedicated data system combining real-world and synthetic data, plus rigorous processing and alignment, underpins stable temporal coherence and cross-modal fidelity. Empirical evaluation with the OmniVideo-1.0 benchmark shows state-of-the-art performance on reference-based generation and editing, along with advanced capabilities like temporal narration and visual-signal controllable generation. Collectively, Kling-Omni represents a major step toward multimodal world simulators that perceive, reason, generate, and interact in dynamic environments.

Abstract

We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.

Paper Structure

This paper contains 33 sections, 28 figures, 1 table.

Figures (28)

  • Figure 1: Overview of Kling-Omni, a generalist framework that introduces multimodal visual language as the interaction mechanism, supporting diverse tasks including video generation, editing, and intelligent reasoning.
  • Figure 2: Attention maps in Multimodal Super-Resolution. The left panel illustrates the map for even-numbered layers, while the right panel shows the map for odd-numbered layers. Skipping the computation for the shaded regions leads to a substantial reduction in computational load and supports accelerated inference with a KV cache.
  • Figure 3: Online training data pipeline. Raw data is distributed across DP/PP groups using an inference scheduler. After inference, a training scheduler reorders data for balanced workload.
  • Figure 4: The pipeline schedule in Kling-Omni. The inference pass of VAE/TE are distributed across both data- and pipeline-parallelism, following an interleaved 1F1B pipeline schedule. Pipeine-aware offloading and onloading are introduced to reduce GPU memory consumption without blocking forward or backward pass, and an online load balance scheduler is running on CPU to determine the ulysses parallel size and the workload for each microbatch.
  • Figure 5: Cross-modal and cross-task data distribution in our constructed data system.
  • ...and 23 more figures