Disentangled Representation Learning via Flow Matching

Jinjin Chi; Taoping Liu; Mengtao Yin; Ximing Li; Yongcheng Jing; Dacheng Tao

Disentangled Representation Learning via Flow Matching

Jinjin Chi, Taoping Liu, Mengtao Yin, Ximing Li, Yongcheng Jing, Dacheng Tao

TL;DR

This work introduces a flow-matching framework for disentangled representation learning by casting disentanglement as learning factor-conditioned flows in a latent space. It decomposes the latent transport velocity into factor-specific components and enforces semantic alignment through an orthogonality regularizer implemented via an output-attention mechanism, enabling non-overlapping, factorwise transformations. Empirical results across Cars3D, Shapes3D, MPI3D-toy, and CelebA show substantial improvements in disentanglement metrics (e.g., FactorVAE score and DCI), better controllability, and competitive sample fidelity compared to VAE-, GAN-, and diffusion-based baselines. The approach provides a deterministic, geometry-driven alternative to stochastic diffusion, with practical benefits in downstream task efficiency and semantic editing. Overall, the method advances disentangled representation learning by aligning factor semantics with latent transport dynamics, achieving reliable factor-level control.

Abstract

Disentangled representation learning aims to capture the underlying explanatory factors of observed data, enabling a principled understanding of the data-generating process. Recent advances in generative modeling have introduced new paradigms for learning such representations. However, existing diffusion-based methods encourage factor independence via inductive biases, yet frequently lack strong semantic alignment. In this work, we propose a flow matching-based framework for disentangled representation learning, which casts disentanglement as learning factor-conditioned flows in a compact latent space. To enforce explicit semantic alignment, we introduce a non-overlap (orthogonality) regularizer that suppresses cross-factor interference and reduces information leakage between factors. Extensive experiments across multiple datasets demonstrate consistent improvements over representative baselines, yielding higher disentanglement scores as well as improved controllability and sample fidelity.

Disentangled Representation Learning via Flow Matching

TL;DR

Abstract

Paper Structure (38 sections, 20 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 38 sections, 20 equations, 4 figures, 7 tables, 1 algorithm.

Introduction
Related Work
VAE-based methods.
GAN-based methods.
Diffusion-based methods.
Background
Probability flow ODE.
Conditional Flow Matching (CFM).
Linear interpolation.
Our Method
Overview.
Flow Matching Objective
Why factorization is needed.
Factorized Velocity via Output Attention
Factorized velocity and regularization.
...and 23 more sections

Figures (4)

Figure 1: Illustration of the MPI3D-toy dataset gondal2019transfer. The seven rectangles represent the underlying factors of variation in the scene, including object color, shape, size, camera height, background color, vertical axis and horizontal axis. Disentangled representation learning aims to encode these distinct generative factors into independent latent variables within the learned representation space.
Figure 2: Illustration of our proposed framework. Given an input image $I$, an image encoder $T_{\gamma}$ extracts a set of factor representations, which serve as conditional inputs to the flow-matching model via cross-attention. The right panel illustrates the Factorized Velocity via Output Attention module, which decomposes the predicted velocity field into factor-specific components and enforces non-overlapping, semantically aligned latent dynamics.
Figure 3: Factor swapping results. Conditional generation results obtained by intervening on a single latent unit. For each pair of images, we encode a source and a target, replace one latent unit in the source code with the corresponding unit from the target, and generate from the modified representation. The first two rows show the source and target images, respectively; rows three to six show the source image with only the swapped attribute (e.g., Wall hue, Object shape) transferred from the target. Left: Shapes3D. Right: MPI3D-toy.
Figure 4: Training loss curves. Left: Shapes3D. Right: MPI3D-toy.

Disentangled Representation Learning via Flow Matching

TL;DR

Abstract

Disentangled Representation Learning via Flow Matching

Authors

TL;DR

Abstract

Table of Contents

Figures (4)