Table of Contents
Fetching ...

Graph-based Unsupervised Disentangled Representation Learning via Multimodal Large Language Models

Baao Xie, Qiuyu Chen, Yunnan Wang, Zequn Zhang, Xin Jin, Wenjun Zeng

TL;DR

This work targets unsupervised disentangled representation learning in data with correlated factors by introducing GEM, which couples a $β$-VAE–based attribute extractor with a multimodal large language model (MLLM) that discovers and ranks interrelations among attributes. The two branches feed a bidirectional weighted DisGraph whose edges carry impact scores and are refined by a graph neural network, enabling fine-grained, relation-aware disentanglement and improved reconstruction. Key contributions include the first use of MLLMs for interrelation discovery in DRL, a self-driven graph framework that handles bidirectional relations with weights, and demonstrated interpretability and generalization benefits from leveraging MLLMs. Empirically, GEM outperforms conventional DRL baselines on CelebA and LSUN in reconstruction quality and yields meaningful, weighted attribute relations, suggesting practical applicability to complex real-world data and broader domains. The integration with powerful generative and reasoning models points to scalable, interpretable DRL driven by structured, commonsense knowledge.

Abstract

Disentangled representation learning (DRL) aims to identify and decompose underlying factors behind observations, thus facilitating data perception and generation. However, current DRL approaches often rely on the unrealistic assumption that semantic factors are statistically independent. In reality, these factors may exhibit correlations, which off-the-shelf solutions have yet to properly address. To tackle this challenge, we introduce a bidirectional weighted graph-based framework, to learn factorized attributes and their interrelations within complex data. Specifically, we propose a $β$-VAE based module to extract factors as the initial nodes of the graph, and leverage the multimodal large language model (MLLM) to discover and rank latent correlations, thereby updating the weighted edges. By integrating these complementary modules, our model successfully achieves fine-grained, practical and unsupervised disentanglement. Experiments demonstrate our method's superior performance in disentanglement and reconstruction. Furthermore, the model inherits enhanced interpretability and generalizability from MLLMs.

Graph-based Unsupervised Disentangled Representation Learning via Multimodal Large Language Models

TL;DR

This work targets unsupervised disentangled representation learning in data with correlated factors by introducing GEM, which couples a -VAE–based attribute extractor with a multimodal large language model (MLLM) that discovers and ranks interrelations among attributes. The two branches feed a bidirectional weighted DisGraph whose edges carry impact scores and are refined by a graph neural network, enabling fine-grained, relation-aware disentanglement and improved reconstruction. Key contributions include the first use of MLLMs for interrelation discovery in DRL, a self-driven graph framework that handles bidirectional relations with weights, and demonstrated interpretability and generalization benefits from leveraging MLLMs. Empirically, GEM outperforms conventional DRL baselines on CelebA and LSUN in reconstruction quality and yields meaningful, weighted attribute relations, suggesting practical applicability to complex real-world data and broader domains. The integration with powerful generative and reasoning models points to scalable, interpretable DRL driven by structured, commonsense knowledge.

Abstract

Disentangled representation learning (DRL) aims to identify and decompose underlying factors behind observations, thus facilitating data perception and generation. However, current DRL approaches often rely on the unrealistic assumption that semantic factors are statistically independent. In reality, these factors may exhibit correlations, which off-the-shelf solutions have yet to properly address. To tackle this challenge, we introduce a bidirectional weighted graph-based framework, to learn factorized attributes and their interrelations within complex data. Specifically, we propose a -VAE based module to extract factors as the initial nodes of the graph, and leverage the multimodal large language model (MLLM) to discover and rank latent correlations, thereby updating the weighted edges. By integrating these complementary modules, our model successfully achieves fine-grained, practical and unsupervised disentanglement. Experiments demonstrate our method's superior performance in disentanglement and reconstruction. Furthermore, the model inherits enhanced interpretability and generalizability from MLLMs.
Paper Structure (16 sections, 9 equations, 7 figures, 2 tables)

This paper contains 16 sections, 9 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: The comparison of typical DRL frameworks with our GEM. The limitations of conventional DRL methods are presented on the left. Conversely, the right-hand side illustrates the advantages of our framework, which benefited from the integration of the $\beta$-VAE and MLLMs.
  • Figure 2: Pipeline of our GEM. The model consists of two complementary branches, termed as a $\beta$-VAE branch (blue) and a MLLM branch (brown). The former utilizes $\beta$-VAE based semantic encoder $E_{sem}$ to disentangle underlying factors, while the latter employs prompt engineering to discover and rank interrelations. The bidirectional weighted DisGragh $G$ is further proposed to embed relation-aware representations, with its parameters optimized constantly by a GNN network $E_{gnn}$.
  • Figure 3: A simplified example of the template for prompting MLLMs to evaluate attributes. Specifically, <text> is the interactive token, while <BOS> and <EOS> are tokens denoting the start and end of the input to MLLMs, respectively.
  • Figure 4: Qualitative comparisons between GEM and typical DRL Methods. Each row in facial images corresponds to the traversal results on a specific attribute, as indicated adjacent to the images (i.e. Bangs, Bald, Gender, Beard, Blond, and Makeup). GEM exhibits superior ability in fine-grained disentanglement with discovered practical and bidirectional relations (illustrated by the heatmap).
  • Figure 5: Relation-aware disentanglement results on LSUN and the attributes beyond CelebA. Paired fine-grained attributes with inconsistent bidirectional relations are chosen to indicate effectiveness.
  • ...and 2 more figures