Table of Contents
Fetching ...

PoM: Efficient Image and Video Generation with the Polynomial Mixer

David Picard, Nicolas Dufour

TL;DR

It is shown the Polynomial Mixer is a universal sequence-to-sequence approximator, just like regular MHA, and adapted several Diffusion Transformers for generating images and videos with PoM replacing MHA, and high quality samples are obtained while using less computational resources.

Abstract

Diffusion models based on Multi-Head Attention (MHA) have become ubiquitous to generate high quality images and videos. However, encoding an image or a video as a sequence of patches results in costly attention patterns, as the requirements both in terms of memory and compute grow quadratically. To alleviate this problem, we propose a drop-in replacement for MHA called the Polynomial Mixer (PoM) that has the benefit of encoding the entire sequence into an explicit state. PoM has a linear complexity with respect to the number of tokens. This explicit state also allows us to generate frames in a sequential fashion, minimizing memory and compute requirement, while still being able to train in parallel. We show the Polynomial Mixer is a universal sequence-to-sequence approximator, just like regular MHA. We adapt several Diffusion Transformers (DiT) for generating images and videos with PoM replacing MHA, and we obtain high quality samples while using less computational resources. The code is available at https://github.com/davidpicard/HoMM.

PoM: Efficient Image and Video Generation with the Polynomial Mixer

TL;DR

It is shown the Polynomial Mixer is a universal sequence-to-sequence approximator, just like regular MHA, and adapted several Diffusion Transformers for generating images and videos with PoM replacing MHA, and high quality samples are obtained while using less computational resources.

Abstract

Diffusion models based on Multi-Head Attention (MHA) have become ubiquitous to generate high quality images and videos. However, encoding an image or a video as a sequence of patches results in costly attention patterns, as the requirements both in terms of memory and compute grow quadratically. To alleviate this problem, we propose a drop-in replacement for MHA called the Polynomial Mixer (PoM) that has the benefit of encoding the entire sequence into an explicit state. PoM has a linear complexity with respect to the number of tokens. This explicit state also allows us to generate frames in a sequential fashion, minimizing memory and compute requirement, while still being able to train in parallel. We show the Polynomial Mixer is a universal sequence-to-sequence approximator, just like regular MHA. We adapt several Diffusion Transformers (DiT) for generating images and videos with PoM replacing MHA, and we obtain high quality samples while using less computational resources. The code is available at https://github.com/davidpicard/HoMM.

Paper Structure

This paper contains 24 sections, 3 theorems, 19 equations, 14 figures, 3 tables.

Key Result

Proposition 1

A Polynomial Mixer is permutation equivariant, i.e., let $X \in \mathbb{R}^{d\times n}$ be a set of vectors and $P$ a column permutation matrix, then $\mathop{\mathrm{\text{PoM}}}\nolimits(XP) = \mathop{\mathrm{\text{PoM}}}\nolimits(X)P$.

Figures (14)

  • Figure 1: Comparison between the speed of PoM and Multi-Head Attention (MHA) in the same DiT-XL/2 architecture for different image resolutions. We use an H100 GPU and compute the average time on 100 synthetic training batches to perform the forward or forward+backward passes. We use synthetic data to remove the influence from data loading. Training with PoM is less costly than inference with MHA at higher resolutions.
  • Figure 2: Diagram for the Polynomial Mixer. The input sequence is split into two paths. The top path expands each token using a polynomial before they are mixed (averaged)² into a single representation. The bottom path expands the tokens into gating coefficients. Both paths are recombined and projected back into the input dimension.
  • Figure 3: Building blocks for our diffusion models using PoM. For class-conditional image generation (a), we follow strictly DiTpeebles23iccv in the AdaLN variant, replacing multi-head attention with PoM. For text to video generation (b), we follow a hybrid approach in which the encoded text tokens are incorporated into the video tokens using PoM instead of cross attention, while the time is used as a modulation. Modulation means component-wise scale and shift modification based on the coefficients predicted by the MLP (similarly to the AdaLN approach).
  • Figure 4: Qualitative results on class-conditional generation. We show images sampled with the model DiPoM-XL/2 trained with the flow-matching loss $\mathcal{L}_\text{FM}$ at several resolutions for different classes. We use classifier-free guidance with $\omega=4s/s_0$ with $s$ the scale of the image and $s_0$ the reference scale (256).
  • Figure 5: Scaling laws for a DiT-like architecture with attention replaced by PoM. FIDs and Inception Scores (IS) are computed on 10k samples with classifier free guidance ($\omega=1$), and shown with a linear regression in log space. Performances scale with the computation budget, similarly to transformers.
  • ...and 9 more figures

Theorems & Definitions (4)

  • Proposition 1: Permutation equivariance
  • proof
  • Theorem 2: Universal approximation
  • Lemma 3: Contextual mapping (informal)