Table of Contents
Fetching ...

Generative AI Beyond LLMs: System Implications of Multi-Modal Generation

Alicia Golden, Samuel Hsia, Fei Sun, Bilge Acun, Basil Hosmer, Yejin Lee, Zachary DeVito, Jeff Johnson, Gu-Yeon Wei, David Brooks, Carole-Jean Wu

TL;DR

The paper investigates system implications of multi-modal generative AI beyond LLMs by comparing diffusion- and transformer-based TTI/TTV models and profiling eight representative workloads. It demonstrates that, even with state-of-the-art optimizations such as Flash Attention, bottlenecks shift (Convolution dominates diffusion-based TTI, Linear/Attention dominate transformer-based TTI), and that diffusion inference exhibits highly variable sequence length while TTVs suffer from temporal attention bottlenecks. It also introduces a practical mapping of LLM concepts to TTI/TTV workloads (Prefill/Decode) and reveals memory scaling of $O(L^4)$ with image/latent dimension, plus $O(n^2)$ scaling for temporal frames, underscoring distinct system challenges. The study offers a critical first step toward designing efficient, deployable systems for emerging TTI/TTV workloads and highlights opportunities for workload-specific optimizations.

Abstract

As the development of large-scale Generative AI models evolve beyond text (1D) generation to include image (2D) and video (3D) generation, processing spatial and temporal information presents unique challenges to quality, performance, and efficiency. We present the first work towards understanding this new system design space for multi-modal text-to-image (TTI) and text-to-video (TTV) generation models. Current model architecture designs are bifurcated into 2 categories: Diffusion- and Transformer-based models. Our systematic performance characterization on a suite of eight representative TTI/TTV models shows that after state-of-the-art optimization techniques such as Flash Attention are applied, Convolution accounts for up to 44% of execution time for Diffusion-based TTI models, while Linear layers consume up to 49% of execution time for Transformer-based models. We additionally observe that Diffusion-based TTI models resemble the Prefill stage of LLM inference, and benefit from 1.1-2.5x greater speedup from Flash Attention than Transformer-based TTI models that resemble the Decode phase. Since optimizations designed for LLMs do not map directly onto TTI/TTV models, we must conduct a thorough characterization of these workloads to gain insights for new optimization opportunities. In doing so, we define sequence length in the context of TTI/TTV models and observe sequence length can vary up to 4x in Diffusion model inference. We additionally observe temporal aspects of TTV workloads pose unique system bottlenecks, with Temporal Attention accounting for over 60% of total Attention time. Overall, our in-depth system performance characterization is a critical first step towards designing efficient and deployable systems for emerging TTI/TTV workloads.

Generative AI Beyond LLMs: System Implications of Multi-Modal Generation

TL;DR

The paper investigates system implications of multi-modal generative AI beyond LLMs by comparing diffusion- and transformer-based TTI/TTV models and profiling eight representative workloads. It demonstrates that, even with state-of-the-art optimizations such as Flash Attention, bottlenecks shift (Convolution dominates diffusion-based TTI, Linear/Attention dominate transformer-based TTI), and that diffusion inference exhibits highly variable sequence length while TTVs suffer from temporal attention bottlenecks. It also introduces a practical mapping of LLM concepts to TTI/TTV workloads (Prefill/Decode) and reveals memory scaling of with image/latent dimension, plus scaling for temporal frames, underscoring distinct system challenges. The study offers a critical first step toward designing efficient, deployable systems for emerging TTI/TTV workloads and highlights opportunities for workload-specific optimizations.

Abstract

As the development of large-scale Generative AI models evolve beyond text (1D) generation to include image (2D) and video (3D) generation, processing spatial and temporal information presents unique challenges to quality, performance, and efficiency. We present the first work towards understanding this new system design space for multi-modal text-to-image (TTI) and text-to-video (TTV) generation models. Current model architecture designs are bifurcated into 2 categories: Diffusion- and Transformer-based models. Our systematic performance characterization on a suite of eight representative TTI/TTV models shows that after state-of-the-art optimization techniques such as Flash Attention are applied, Convolution accounts for up to 44% of execution time for Diffusion-based TTI models, while Linear layers consume up to 49% of execution time for Transformer-based models. We additionally observe that Diffusion-based TTI models resemble the Prefill stage of LLM inference, and benefit from 1.1-2.5x greater speedup from Flash Attention than Transformer-based TTI models that resemble the Decode phase. Since optimizations designed for LLMs do not map directly onto TTI/TTV models, we must conduct a thorough characterization of these workloads to gain insights for new optimization opportunities. In doing so, we define sequence length in the context of TTI/TTV models and observe sequence length can vary up to 4x in Diffusion model inference. We additionally observe temporal aspects of TTV workloads pose unique system bottlenecks, with Temporal Attention accounting for over 60% of total Attention time. Overall, our in-depth system performance characterization is a critical first step towards designing efficient and deployable systems for emerging TTI/TTV workloads.
Paper Structure (19 sections, 6 equations, 14 figures, 2 tables)

This paper contains 19 sections, 6 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Across industry-scale datacenters, Text-to-Image (TTI) models use roughly 14x more GPUs per model parameter during training and 1.35x higher memory utilization as compared to LLMs, demonstrating their growing importance at the datacenter scale.
  • Figure 2: Common Text-to-Image Model Architectures. Models consist of multiple independently-trained components, and are strung together during inference (shown here) to take text as input and generate an image output. Note that the top two models use a diffusion-based architectures (green), while bottom two models use transformer-based architectures (red).
  • Figure 3: Detail on Diffusion and Transformer models. Note that Diffusion models consist of Resnet blocks, Self-Attention blocks, and Cross-Attention blocks while Transformer-based models consider Self/Cross Attention and FeedForward.
  • Figure 4: Pareto-Optimal curve showing tradeoff between model quality and system resources for various Text-to-Image models. Bottom left corner is optimal. Bolded points represent models further examined in model suite outlined in Section \ref{['section3']}. Note that a corresponding figure for TTV models shows a similar trend. Diffusion TTV models often have both the lowest number of parameters and FID score.
  • Figure 5: Text-to-Image/Video Models Roofline on A100 GPU. Diffusion models have higher arithmetic intensity than transformer-based TTI models, and fall in the compute-bound region. Transformer-based models are memory-bandwidth bound at low batch sizes.
  • ...and 9 more figures