Table of Contents
Fetching ...

Rethinking Multi-view Representation Learning via Distilled Disentangling

Guanzhou Ke, Bo Wang, Xiaoli Wang, Shengfeng He

TL;DR

Rethinking multi-view representation learning targets the redundancy between view-consistent and view-specific information. The authors introduce MRDD, a two-stage framework that uses masked cross-view prediction (MCP) to learn compact view-consistent representations with a single encoder and a distilled disentangling (DD) module to purify view-specific representations, guided by priors $p(\bold{c}) \sim \mathcal{N}(\bold{0}, \bold{I})$ and CLUB-based mutual information bounds. They demonstrate that high MCP mask ratios improve the quality of the consistent component and that reducing the dimensionality of the consistent representation relative to the view-specific components yields better joint representations, achieving state-of-the-art performance on five multi-view datasets. The work provides practical guidance for disentangled multi-view learning and highlights the importance of masking strategies and representation density in enabling efficient and effective cross-view learning.

Abstract

Multi-view representation learning aims to derive robust representations that are both view-consistent and view-specific from diverse data sources. This paper presents an in-depth analysis of existing approaches in this domain, highlighting a commonly overlooked aspect: the redundancy between view-consistent and view-specific representations. To this end, we propose an innovative framework for multi-view representation learning, which incorporates a technique we term 'distilled disentangling'. Our method introduces the concept of masked cross-view prediction, enabling the extraction of compact, high-quality view-consistent representations from various sources without incurring extra computational overhead. Additionally, we develop a distilled disentangling module that efficiently filters out consistency-related information from multi-view representations, resulting in purer view-specific representations. This approach significantly reduces redundancy between view-consistent and view-specific representations, enhancing the overall efficiency of the learning process. Our empirical evaluations reveal that higher mask ratios substantially improve the quality of view-consistent representations. Moreover, we find that reducing the dimensionality of view-consistent representations relative to that of view-specific representations further refines the quality of the combined representations. Our code is accessible at: https://github.com/Guanzhou-Ke/MRDD.

Rethinking Multi-view Representation Learning via Distilled Disentangling

TL;DR

Rethinking multi-view representation learning targets the redundancy between view-consistent and view-specific information. The authors introduce MRDD, a two-stage framework that uses masked cross-view prediction (MCP) to learn compact view-consistent representations with a single encoder and a distilled disentangling (DD) module to purify view-specific representations, guided by priors and CLUB-based mutual information bounds. They demonstrate that high MCP mask ratios improve the quality of the consistent component and that reducing the dimensionality of the consistent representation relative to the view-specific components yields better joint representations, achieving state-of-the-art performance on five multi-view datasets. The work provides practical guidance for disentangled multi-view learning and highlights the importance of masking strategies and representation density in enabling efficient and effective cross-view learning.

Abstract

Multi-view representation learning aims to derive robust representations that are both view-consistent and view-specific from diverse data sources. This paper presents an in-depth analysis of existing approaches in this domain, highlighting a commonly overlooked aspect: the redundancy between view-consistent and view-specific representations. To this end, we propose an innovative framework for multi-view representation learning, which incorporates a technique we term 'distilled disentangling'. Our method introduces the concept of masked cross-view prediction, enabling the extraction of compact, high-quality view-consistent representations from various sources without incurring extra computational overhead. Additionally, we develop a distilled disentangling module that efficiently filters out consistency-related information from multi-view representations, resulting in purer view-specific representations. This approach significantly reduces redundancy between view-consistent and view-specific representations, enhancing the overall efficiency of the learning process. Our empirical evaluations reveal that higher mask ratios substantially improve the quality of view-consistent representations. Moreover, we find that reducing the dimensionality of view-consistent representations relative to that of view-specific representations further refines the quality of the combined representations. Our code is accessible at: https://github.com/Guanzhou-Ke/MRDD.
Paper Structure (24 sections, 10 equations, 10 figures, 6 tables, 1 algorithm)

This paper contains 24 sections, 10 equations, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: Existing multi-view representation learning methods show high inter-view correlations. We estimate the mutual information of multi-view consistency and specificity of three baseline MvRL models DVIB bao2021disentangled, CONAN ke2021conan, Multi-VAE xu2021multi, and our method using MINE belghazi2018mutual on the same settings across five datasets.
  • Figure 2: Illustration of the workflow of the proposed framework. The objective of stage I is to exploit the masked cross-view prediction strategy to uncover view-consistent representations. Initially, a consistent encoder is employed to learn consistent representations from all masked data. Ultimately, several decoders are utilized to predict the removed content in the corresponding views. The objective of stage II is to obtain high-quality view-specific representations by filtering out consistency-related information in specific representations. We assume the standard Gaussian distribution as the prior for all representations.
  • Figure 3: Masked ratio. Classification accuracy scores (%) for masked ratios range from $0\%$ to $90\%$ on five datasets.
  • Figure 4: The clustering results (%) of the different dimensions of consistency and specificity on the E-MNIST and E-FMNIST datasets. The x-axis represents the consistency dimension, the y-axis represents the specificity dimension, and the z-axis represents the clustering accuracy.
  • Figure 5: Visualization of the representations of MRDD-$\bold{c}$ and MRDD-$\bold{cs}$ using t-SNE van2008visualizing on the E-MNIST and E-FMNIST datasets.
  • ...and 5 more figures