Table of Contents
Fetching ...

Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis

Hila Chefer, Patrick Esser, Dominik Lorenz, Dustin Podell, Vikash Raja, Vinh Tong, Antonio Torralba, Robin Rombach

TL;DR

This work introduces Self-Flow: a self-supervised flow matching paradigm that integrates representation learning within the generative framework and generalizes across modalities and enables multi-modal training while following expected scaling laws, achieving superior image, video, and audio generation.

Abstract

Strong semantic representations improve the convergence and generation quality of diffusion and flow models. Existing approaches largely rely on external models, which require separate training, operate on misaligned objectives, and exhibit unexpected scaling behavior. We argue that this dependence arises from the model's training objective, which poses a denoising task with little incentive to learn semantic representations. We introduce Self-Flow: a self-supervised flow matching paradigm that integrates representation learning within the generative framework. Our key mechanism, Dual-Timestep Scheduling, applies heterogeneous noise levels across tokens, creating an information asymmetry that forces the model to infer missing information from corrupted inputs. This drives learning strong representations alongside generative capabilities without external supervision. Our method generalizes across modalities and enables multi-modal training while following expected scaling laws, achieving superior image, video, and audio generation.

Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis

TL;DR

This work introduces Self-Flow: a self-supervised flow matching paradigm that integrates representation learning within the generative framework and generalizes across modalities and enables multi-modal training while following expected scaling laws, achieving superior image, video, and audio generation.

Abstract

Strong semantic representations improve the convergence and generation quality of diffusion and flow models. Existing approaches largely rely on external models, which require separate training, operate on misaligned objectives, and exhibit unexpected scaling behavior. We argue that this dependence arises from the model's training objective, which poses a denoising task with little incentive to learn semantic representations. We introduce Self-Flow: a self-supervised flow matching paradigm that integrates representation learning within the generative framework. Our key mechanism, Dual-Timestep Scheduling, applies heterogeneous noise levels across tokens, creating an information asymmetry that forces the model to infer missing information from corrupted inputs. This drives learning strong representations alongside generative capabilities without external supervision. Our method generalizes across modalities and enables multi-modal training while following expected scaling laws, achieving superior image, video, and audio generation.
Paper Structure (29 sections, 10 equations, 34 figures, 7 tables)

This paper contains 29 sections, 10 equations, 34 figures, 7 tables.

Figures (34)

  • Figure 1: Results obtained by our self-supervised flow matching framework, Self-Flow. (a) On text-to-image generation, our method converges $\sim2.8\times$ faster than REPA repa, the predominant external-alignment method, without using any external models or supervision. Notably, REPA plateaus while our method continues to improve. (b) Compared to vanilla flow matching, our approach improves structural coherence, text rendering, and temporal consistency.
  • Figure 2: Motivation. (a) Scaling DINO (v2-B$<$ v2-L$<$ v3-H+) paradoxically degrades the generations quality using REPA. (b) Diffusion forcing and full masking create a train-inference gap which degrades generations. Our Dual-Timestep Scheduling improves generation even without a self-supervised objective.
  • Figure 3: Illustration of our method. Given a clean input $x_0$, we draw two timesteps $t, s$, and a random mask $M$, then noise each token according to its assigned timestep. The teacher input is noised with $\tau_{min}=\min\{t,s\}$, creating an information asymmetry compared to the student. The student is trained to simultaneously denoise the input and reconstruct the teacher's features given its mixed-noised view.
  • Figure 4: Autoencoder generalization and improved representations. (a) Our method improves training and generation in RAE rae, demonstrating compatibility with semantic latent spaces. (b) Linear probing confirms that our method learns stronger representations than standard flow matching.
  • Figure 5: Quantitative results across modalities. Our method significantly outperforms all external and internal alignment methods across text-based image, video, and audio generation. Our method is the only one to outperform REPA on DINOv2 FD (despite REPA directly aligning with DINOv2). Arrows indicate whether lower ($\downarrow$) or higher ($\uparrow$) is better.
  • ...and 29 more figures