Table of Contents
Fetching ...

The Design Space of Tri-Modal Masked Diffusion Models

Louis Bethune, Victor Turrisi, Bruno Kacper Mlodozeniec, Pau Rodriguez Lopez, Lokesh Boominathan, Nikhil Bhendawade, Amitis Shidani, Joris Pelemans, Theo X. Olausson, Devon Hjelm, Paul Dixon, Joao Monteiro, Pierre Ablin, Vishnu Banna, Arno Blaas, Nick Henderson, Kari Noriy, Dan Busbridge, Josh Susskind, Marco Cuturi, Irina Belousova, Luca Zappella, Russ Webb, Jason Ramapuram

TL;DR

This work introduces the first tri-modal masked diffusion model pretrained from scratch on text, image-text, and audio-text data, and systematically analyze multimodal scaling laws, modality mixing ratios, noise schedules, and batch-size effects.

Abstract

Discrete diffusion models have emerged as strong alternatives to autoregressive language models, with recent work initializing and fine-tuning a base unimodal model for bimodal generation. Diverging from previous approaches, we introduce the first tri-modal masked diffusion model pretrained from scratch on text, image-text, and audio-text data. We systematically analyze multimodal scaling laws, modality mixing ratios, noise schedules, and batch-size effects, and we provide optimized inference sampling defaults. Our batch-size analysis yields a novel stochastic differential equation (SDE)-based reparameterization that eliminates the need for tuning the optimal batch size as reported in recent work. This reparameterization decouples the physical batch size, often chosen based on compute constraints (GPU saturation, FLOP efficiency, wall-clock time), from the logical batch size, chosen to balance gradient variance during stochastic optimization. Finally, we pretrain a preliminary 3B-parameter tri-modal model on 6.4T tokens, demonstrating the capabilities of a unified design and achieving strong results in text generation, text-to-image tasks, and text-to-speech tasks. Our work represents the largest-scale systematic open study of multimodal discrete diffusion models conducted to date, providing insights into scaling behaviors across multiple modalities.

The Design Space of Tri-Modal Masked Diffusion Models

TL;DR

This work introduces the first tri-modal masked diffusion model pretrained from scratch on text, image-text, and audio-text data, and systematically analyze multimodal scaling laws, modality mixing ratios, noise schedules, and batch-size effects.

Abstract

Discrete diffusion models have emerged as strong alternatives to autoregressive language models, with recent work initializing and fine-tuning a base unimodal model for bimodal generation. Diverging from previous approaches, we introduce the first tri-modal masked diffusion model pretrained from scratch on text, image-text, and audio-text data. We systematically analyze multimodal scaling laws, modality mixing ratios, noise schedules, and batch-size effects, and we provide optimized inference sampling defaults. Our batch-size analysis yields a novel stochastic differential equation (SDE)-based reparameterization that eliminates the need for tuning the optimal batch size as reported in recent work. This reparameterization decouples the physical batch size, often chosen based on compute constraints (GPU saturation, FLOP efficiency, wall-clock time), from the logical batch size, chosen to balance gradient variance during stochastic optimization. Finally, we pretrain a preliminary 3B-parameter tri-modal model on 6.4T tokens, demonstrating the capabilities of a unified design and achieving strong results in text generation, text-to-image tasks, and text-to-speech tasks. Our work represents the largest-scale systematic open study of multimodal discrete diffusion models conducted to date, providing insights into scaling behaviors across multiple modalities.
Paper Structure (64 sections, 32 equations, 29 figures, 10 tables)

This paper contains 64 sections, 32 equations, 29 figures, 10 tables.

Figures (29)

  • Figure 1: High-fidelity generation. The pretrain-only 3B demonstrates strong prompt adherence alongside high-quality visual rendering of texture, lighting, and composition. Samples show: (a) natural daylight and depth of field ("egg in a field of crocuses"); (b) fine-grained fur texture in B&W ("lion's face"); (c) soft, warm lighting with vintage color tones ("preparing bread dough"); and (d) complex multi-object arrangement ("noodle soup with toppings"). Extended generations in Appendix \ref{['app:extended_generations']}.
  • Figure 2: Tri-Modal masked diffusion model architecture. Pure text is packed. Image-caption and audio-transcription pairs are padded to maximum length. Padding is ignored by attention and loss computation.
  • Figure 3: Token-optimal curve $D^{\star}(N)$ for different model families. In tri-modal MDM, token count growth sub-linearly with model size, suggesting diminishing returns of additional data. We use identical methodology to report all curves.
  • Figure 4: Below the critical batch size $B_{\text{crit}}$ the SDE parametrization guarantees constant loss. In that regime, larger batch sizes allow fewer iterations. Above it, SDE discretization breaks and training ceases to be FLOP-efficient.
  • Figure 5: Critical iteration count $S_{\text{crit}}$ is constant w.r.t. model size under the SDE regime. This is compatible with the findings of DBLP:journals/corr/abs-2505-13738, but their study was done outside the SDE regime.
  • ...and 24 more figures