Table of Contents
Fetching ...

Grid: Omni Visual Generation

Cong Wan, Xiangyang Luo, Hao Luo, Zijian Cai, Yiren Song, Yunlong Zhao, Yifan Bai, Fan Wang, Yuhang He, Yihong Gong

TL;DR

GRID introduces a grid-based reformulation of temporal visual sequences that enables holistic, parallel generation by existing image diffusion models. By coupling grid representations with parallel flow matching and a coarse-to-fine training schedule, GRID attains strong temporal coherence and spatial consistency across Text-to-Video, Image-to-Video, and multi-view tasks, while dramatically reducing training data and computational demands. The approach delivers up to tens of times faster inference and competitive quality, with broad extension capabilities to video restoration, motion cloning, and 3D editing, all without task-specific architectures. Its zero-shot generalization and omni-generation potential offer a practical, scalable path toward unified visual sequence synthesis. Overall, GRID achieves a versatile, resource-efficient omni-solution for visual generation that preserves image-generation strengths while expanding into dynamic, multi-view content.

Abstract

Visual generation has witnessed remarkable progress in single-image tasks, yet extending these capabilities to temporal sequences remains challenging. Current approaches either build specialized video models from scratch with enormous computational costs or add separate motion modules to image generators, both requiring learning temporal dynamics anew. We observe that modern image generation models possess underutilized potential in handling structured layouts with implicit temporal understanding. Building on this insight, we introduce GRID, which reformulates temporal sequences as grid layouts, enabling holistic processing of visual sequences while leveraging existing model capabilities. Through a parallel flow-matching training strategy with coarse-to-fine scheduling, our approach achieves up to 67 faster inference speeds while using <1/1000 of the computational resources compared to specialized models. Extensive experiments demonstrate that GRID not only excels in temporal tasks from Text-to-Video to 3D Editing but also preserves strong performance in image generation, establishing itself as an efficient and versatile omni-solution for visual generation.

Grid: Omni Visual Generation

TL;DR

GRID introduces a grid-based reformulation of temporal visual sequences that enables holistic, parallel generation by existing image diffusion models. By coupling grid representations with parallel flow matching and a coarse-to-fine training schedule, GRID attains strong temporal coherence and spatial consistency across Text-to-Video, Image-to-Video, and multi-view tasks, while dramatically reducing training data and computational demands. The approach delivers up to tens of times faster inference and competitive quality, with broad extension capabilities to video restoration, motion cloning, and 3D editing, all without task-specific architectures. Its zero-shot generalization and omni-generation potential offer a practical, scalable path toward unified visual sequence synthesis. Overall, GRID achieves a versatile, resource-efficient omni-solution for visual generation that preserves image-generation strengths while expanding into dynamic, multi-view content.

Abstract

Visual generation has witnessed remarkable progress in single-image tasks, yet extending these capabilities to temporal sequences remains challenging. Current approaches either build specialized video models from scratch with enormous computational costs or add separate motion modules to image generators, both requiring learning temporal dynamics anew. We observe that modern image generation models possess underutilized potential in handling structured layouts with implicit temporal understanding. Building on this insight, we introduce GRID, which reformulates temporal sequences as grid layouts, enabling holistic processing of visual sequences while leveraging existing model capabilities. Through a parallel flow-matching training strategy with coarse-to-fine scheduling, our approach achieves up to 67 faster inference speeds while using <1/1000 of the computational resources compared to specialized models. Extensive experiments demonstrate that GRID not only excels in temporal tasks from Text-to-Video to 3D Editing but also preserves strong performance in image generation, establishing itself as an efficient and versatile omni-solution for visual generation.

Paper Structure

This paper contains 41 sections, 13 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: Different paradigms for temporal visual generation. (a) Motion-Scratch (e.g., SVD, AnimateDiff): learn temporal dynamics from scratch while reusing pretrained image models. (b) Full-Scratch (e.g., Sora): learn everything from scratch, requiring massive data and computational resources. (c) Zero-Scratch (GRID): reuse both spatial and temporal capabilities through grid-based reformulation, leveraging pretrained models' inherent understanding.
  • Figure 2: Pipeline Overview. Left: GRID arranges videos into grid layouts, with text annotations combining layout format prefix and LLM-generated captions. The model is trained using LoRA fine-tuning on DIT blocks, incorporating both base loss and temporal loss to capture inter-frame relationships. Right: Grid-based reformulation naturally extends model's built-in self-attention to include frame-wise self-attention, cross-frame attention, and text-to-frames cross-attention.
  • Figure 3: Omni Inference Framework: By transforming temporal and view sequences into structured layout spaces, we enable a pure image-based model FLUX to tackle diverse video and multi-view tasks (text/image-to-video generation, video interpolation, and multi-view synthesis) through a unified pipeline without additional video-specific architectures.
  • Figure 4: Zero-shot evaluation of foundation models on grid-based multi-view generation tasks before we begin to train. Using the prompt "a * from different angles in a mxn grid layout,"
  • Figure 5: Comparison of attention mechanisms. (a) Traditional video diffusion models rely on three separate attention modules to handle spatial understanding, semantic guidance, and temporal consistency respectively. (b) Through our grid layout reformulation, FLUX's unified self-attention naturally encompasses both inner-frame ($I_i,I_i$) and cross-frame ($I_i,I_j$) relationships, while its global text-image attention ($T,I$) enables consistent control across all frames. This simplification eliminates the need for specialized temporal modules while maintaining effective spatio-temporal understanding.
  • ...and 9 more figures