Rethinking Multi-view Representation Learning via Distilled Disentangling
Guanzhou Ke, Bo Wang, Xiaoli Wang, Shengfeng He
TL;DR
Rethinking multi-view representation learning targets the redundancy between view-consistent and view-specific information. The authors introduce MRDD, a two-stage framework that uses masked cross-view prediction (MCP) to learn compact view-consistent representations with a single encoder and a distilled disentangling (DD) module to purify view-specific representations, guided by priors $p(\bold{c}) \sim \mathcal{N}(\bold{0}, \bold{I})$ and CLUB-based mutual information bounds. They demonstrate that high MCP mask ratios improve the quality of the consistent component and that reducing the dimensionality of the consistent representation relative to the view-specific components yields better joint representations, achieving state-of-the-art performance on five multi-view datasets. The work provides practical guidance for disentangled multi-view learning and highlights the importance of masking strategies and representation density in enabling efficient and effective cross-view learning.
Abstract
Multi-view representation learning aims to derive robust representations that are both view-consistent and view-specific from diverse data sources. This paper presents an in-depth analysis of existing approaches in this domain, highlighting a commonly overlooked aspect: the redundancy between view-consistent and view-specific representations. To this end, we propose an innovative framework for multi-view representation learning, which incorporates a technique we term 'distilled disentangling'. Our method introduces the concept of masked cross-view prediction, enabling the extraction of compact, high-quality view-consistent representations from various sources without incurring extra computational overhead. Additionally, we develop a distilled disentangling module that efficiently filters out consistency-related information from multi-view representations, resulting in purer view-specific representations. This approach significantly reduces redundancy between view-consistent and view-specific representations, enhancing the overall efficiency of the learning process. Our empirical evaluations reveal that higher mask ratios substantially improve the quality of view-consistent representations. Moreover, we find that reducing the dimensionality of view-consistent representations relative to that of view-specific representations further refines the quality of the combined representations. Our code is accessible at: https://github.com/Guanzhou-Ke/MRDD.
