Table of Contents
Fetching ...

Lateralization MLP: A Simple Brain-inspired Architecture for Diffusion

Zizhao Hu, Mohammad Rostami

TL;DR

This work introduces L-MLP, a simple brain-inspired diffusion backbone that uses dimension-permutation and two hemispheric branches to enable parallel processing before a joint MLP, aiming to supplant expensive self-attention. A U-shaped variant (UL-MLP) tailored for latent diffusion demonstrates competitive visual generation quality against Transformer backbones at reduced computational cost. Through extensive MS-COCO diffusion experiments and ablations, the study shows that carefully designed L-MLP blocks can approach Transformer performance while offering improved efficiency and stability, with learning dynamics that echo brain-inspired lateralization. The findings suggest that attention is not strictly necessary for high-quality diffusion-based generation and that brain-inspired architectural priors can yield practical gains for multimodal synthesis.

Abstract

The Transformer architecture has dominated machine learning in a wide range of tasks. The specific characteristic of this architecture is an expensive scaled dot-product attention mechanism that models the inter-token interactions, which is known to be the reason behind its success. However, such a mechanism does not have a direct parallel to the human brain which brings the question if the scaled-dot product is necessary for intelligence with strong expressive power. Inspired by the lateralization of the human brain, we propose a new simple but effective architecture called the Lateralization MLP (L-MLP). Stacking L-MLP blocks can generate complex architectures. Each L-MLP block is based on a multi-layer perceptron (MLP) that permutes data dimensions, processes each dimension in parallel, merges them, and finally passes through a joint MLP. We discover that this specific design outperforms other MLP variants and performs comparably to a transformer-based architecture in the challenging diffusion task while being highly efficient. We conduct experiments using text-to-image generation tasks to demonstrate the effectiveness and efficiency of L-MLP. Further, we look into the model behavior and discover a connection to the function of the human brain. Our code is publicly available: \url{https://github.com/zizhao-hu/L-MLP}

Lateralization MLP: A Simple Brain-inspired Architecture for Diffusion

TL;DR

This work introduces L-MLP, a simple brain-inspired diffusion backbone that uses dimension-permutation and two hemispheric branches to enable parallel processing before a joint MLP, aiming to supplant expensive self-attention. A U-shaped variant (UL-MLP) tailored for latent diffusion demonstrates competitive visual generation quality against Transformer backbones at reduced computational cost. Through extensive MS-COCO diffusion experiments and ablations, the study shows that carefully designed L-MLP blocks can approach Transformer performance while offering improved efficiency and stability, with learning dynamics that echo brain-inspired lateralization. The findings suggest that attention is not strictly necessary for high-quality diffusion-based generation and that brain-inspired architectural priors can yield practical gains for multimodal synthesis.

Abstract

The Transformer architecture has dominated machine learning in a wide range of tasks. The specific characteristic of this architecture is an expensive scaled dot-product attention mechanism that models the inter-token interactions, which is known to be the reason behind its success. However, such a mechanism does not have a direct parallel to the human brain which brings the question if the scaled-dot product is necessary for intelligence with strong expressive power. Inspired by the lateralization of the human brain, we propose a new simple but effective architecture called the Lateralization MLP (L-MLP). Stacking L-MLP blocks can generate complex architectures. Each L-MLP block is based on a multi-layer perceptron (MLP) that permutes data dimensions, processes each dimension in parallel, merges them, and finally passes through a joint MLP. We discover that this specific design outperforms other MLP variants and performs comparably to a transformer-based architecture in the challenging diffusion task while being highly efficient. We conduct experiments using text-to-image generation tasks to demonstrate the effectiveness and efficiency of L-MLP. Further, we look into the model behavior and discover a connection to the function of the human brain. Our code is publicly available: \url{https://github.com/zizhao-hu/L-MLP}
Paper Structure (21 sections, 6 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 21 sections, 6 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: L-MLP block design and latent diffusion pipeline built from L-MLP blocks.
  • Figure 2: Block design comparison with MLP-Mixer and gMLP. For all designs, see Appendix A.2.
  • Figure 3: Generated images from text prompts (From left to right, top to bottom: 'A green train is coming down the tracks.', 'A group of skiers are preparing to ski down a mountain.', 'A road with traffic lights, street lights, and cars.', 'A bus driving in a city area with traffic signs.') at 100K steps for different models. We compare our L-MLP (F2) with two MLP variants and Transformers (U-ViT).
  • Figure 4: Generated samples from validation prompts. Both models share similar visual features. See Appendix A.4 for more comparisons.
  • Figure 5: FID and CLIP scores comparison with U-ViT-S/2. The UL-MLP shows similar training behavior as the Transformer-based counterpart while converging slightly faster.
  • ...and 3 more figures