Table of Contents
Fetching ...

Neodragon: Mobile Video Generation using Diffusion Transformer

Animesh Karnewar, Denis Korzhenkov, Ioannis Lelekas, Adil Karjauv, Noor Fathima, Hanwen Xiong, Vancheeswaran Vaidyanathan, Will Zeng, Rafael Esteves, Tushar Singhal, Fatih Porikli, Mohsen Ghafoorian, Amirhossein Habibian

TL;DR

Neodragon demonstrates that high-fidelity text-to-video can be delivered entirely on mobile hardware by jointly engineering a diffusion-transformer pipeline. The approach combines four innovations—Text-Encoder Distillation to compress T5-XXL into DT5 with a ContextAdapter, Asymmetric Decoder Distillation to swap in a mobile-friendly decoder, MMDiT Block Pruning to shrink the denoiser backbone, and Step Distillation to dramatically reduce NFEs—resulting in a 2s video (49 frames at 24fps) at 640x1024 with 6.7s end-to-end latency on Qualcomm Hexagon NPUs. The system achieves a VBench score of 81.61 and uses SSD1B for first-frame enhancement plus QuickSRNet for 2x super-resolution, enabling private, on-device generation without cloud reliance. This work not only advances on-device video synthesis but also establishes a practical blueprint for modular, mobile-friendly diffusion-based video systems, with broad potential for real-time creative applications.

Abstract

We introduce Neodragon, a text-to-video system capable of generating 2s (49 frames @24 fps) videos at the 640x1024 resolution directly on a Qualcomm Hexagon NPU in a record 6.7s (7 FPS). Differing from existing transformer-based offline text-to-video generation models, Neodragon is the first to have been specifically optimised for mobile hardware to achieve efficient and high-fidelity video synthesis. We achieve this through four key technical contributions: (1) Replacing the original large 4.762B T5xxl Text-Encoder with a much smaller 0.2B DT5 (DistilT5) with minimal quality loss, enabled through a novel Text-Encoder Distillation procedure. (2) Proposing an Asymmetric Decoder Distillation approach allowing us to replace the native codec-latent-VAE decoder with a more efficient one, without disturbing the generative latent-space of the generation pipeline. (3) Pruning of MMDiT blocks within the denoiser backbone based on their relative importance, with recovery of original performance through a two-stage distillation process. (4) Reducing the NFE (Neural Functional Evaluation) requirement of the denoiser by performing step distillation using DMD adapted for pyramidal flow-matching, thereby substantially accelerating video generation. When paired with an optimised SSD1B first-frame image generator and QuickSRNet for 2x super-resolution, our end-to-end Neodragon system becomes a highly parameter (4.945B full model), memory (3.5GB peak RAM usage), and runtime (6.7s E2E latency) efficient mobile-friendly model, while achieving a VBench total score of 81.61. By enabling low-cost, private, and on-device text-to-video synthesis, Neodragon democratizes AI-based video content creation, empowering creators to generate high-quality videos without reliance on cloud services. Code and model will be made publicly available at our website: https://qualcomm-ai-research.github.io/neodragon

Neodragon: Mobile Video Generation using Diffusion Transformer

TL;DR

Neodragon demonstrates that high-fidelity text-to-video can be delivered entirely on mobile hardware by jointly engineering a diffusion-transformer pipeline. The approach combines four innovations—Text-Encoder Distillation to compress T5-XXL into DT5 with a ContextAdapter, Asymmetric Decoder Distillation to swap in a mobile-friendly decoder, MMDiT Block Pruning to shrink the denoiser backbone, and Step Distillation to dramatically reduce NFEs—resulting in a 2s video (49 frames at 24fps) at 640x1024 with 6.7s end-to-end latency on Qualcomm Hexagon NPUs. The system achieves a VBench score of 81.61 and uses SSD1B for first-frame enhancement plus QuickSRNet for 2x super-resolution, enabling private, on-device generation without cloud reliance. This work not only advances on-device video synthesis but also establishes a practical blueprint for modular, mobile-friendly diffusion-based video systems, with broad potential for real-time creative applications.

Abstract

We introduce Neodragon, a text-to-video system capable of generating 2s (49 frames @24 fps) videos at the 640x1024 resolution directly on a Qualcomm Hexagon NPU in a record 6.7s (7 FPS). Differing from existing transformer-based offline text-to-video generation models, Neodragon is the first to have been specifically optimised for mobile hardware to achieve efficient and high-fidelity video synthesis. We achieve this through four key technical contributions: (1) Replacing the original large 4.762B T5xxl Text-Encoder with a much smaller 0.2B DT5 (DistilT5) with minimal quality loss, enabled through a novel Text-Encoder Distillation procedure. (2) Proposing an Asymmetric Decoder Distillation approach allowing us to replace the native codec-latent-VAE decoder with a more efficient one, without disturbing the generative latent-space of the generation pipeline. (3) Pruning of MMDiT blocks within the denoiser backbone based on their relative importance, with recovery of original performance through a two-stage distillation process. (4) Reducing the NFE (Neural Functional Evaluation) requirement of the denoiser by performing step distillation using DMD adapted for pyramidal flow-matching, thereby substantially accelerating video generation. When paired with an optimised SSD1B first-frame image generator and QuickSRNet for 2x super-resolution, our end-to-end Neodragon system becomes a highly parameter (4.945B full model), memory (3.5GB peak RAM usage), and runtime (6.7s E2E latency) efficient mobile-friendly model, while achieving a VBench total score of 81.61. By enabling low-cost, private, and on-device text-to-video synthesis, Neodragon democratizes AI-based video content creation, empowering creators to generate high-quality videos without reliance on cloud services. Code and model will be made publicly available at our website: https://qualcomm-ai-research.github.io/neodragon

Paper Structure

This paper contains 38 sections, 34 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: An overview of optimisation process steps of Neodragon, our proposed efficient text-to-video generation system designed to run directly on mobile devices powered by Qualcomm Hexagon NPU.
  • Figure 2: Overview of the Pyramidal Autoregressive Video Diffusion Pipeline. The pyramidal autoregressive video diffusion scheme jin2024pyramidal differs from the conventional latent-diffusion in how the the latent-video frames are generated (iteratively denoised). The latent frames are autoregressively generated one-by-one by denoising the curent frame while conditioning on the past history. A spatio-temporal pyramid is applied in the denoising process as: firstly the denoising of the current frame starts from a lower resolution and proceeds to reach the highest native latent-resolution; and secondly, each denoising step is conditioned on past history, where the frames from the further past are spatially downsampled.
  • Figure 3: Overview of the proposed Text-Encoder Distillation framework. The original large-scale text-encoder $\mathit{T5}_\text{XXL}$ is distilled into a light-weight model via a trainable $\mathit{CA}$ (ContextAdapter) module, using a combination of MSE and Cosine Distance loss to align the embeddings. Multiple modes are supported in our framework -- Replace Mode [RM]: where the new $\mathit{CA}$replaces the original $\mathit{CE}$ (ContextEmbedder); Extend Mode [EM]: where the new $\mathit{CA}$extends the original $\mathit{CE}$; Lora Mode [LORA]: Where the $\mathit{CA}$ is not a separate MLP, but LoRA hu2022lora layers on top of the $\mathit{DT5}$ text-encoder; and, we allow training the smaller text-encoder v/s keeping it frozen via [TDT5] (Trainable-$\mathit{DT5}$) mode.
  • Figure 4: Qualitative Evaluation of Text-Encoder Distillation. We visualise randomly selected frames from the generated [49×320×512] videos corresponding to the adjacent text prompts, across the four modes supported by our Text-Encoder Distillation framework: [RM], [EM], LORA, and [TDT5 ].
  • Figure 5: Ablations for Text-Encoder Distillation. We ablate the loss weights $w_\text{mse}$ and $w_\text{cd}$ for the [RM] mode in (a); and ablate the two controllable hyperparameters of the LoRA layers, namely dimensions (dims) and the scale (alpha) of [LORA] mode in (b).
  • ...and 10 more figures