Table of Contents
Fetching ...

Multimodal Structure Learning: Disentangling Shared and Specific Topology via Cross-Modal Graphical Lasso

Fei Wang, Yutong Zhang, Xiong Wang

Abstract

Learning interpretable multimodal representations inherently relies on uncovering the conditional dependencies between heterogeneous features. However, sparse graph estimation techniques, such as Graphical Lasso (GLasso), to visual-linguistic domains is severely bottlenecked by high-dimensional noise, modality misalignment, and the confounding of shared versus category-specific topologies. In this paper, we propose Cross-Modal Graphical Lasso (CM-GLasso) that overcomes these fundamental limitations. By coupling a novel text-visualization strategy with a unified vision-language encoder, we strictly align multimodal features into a shared latent space. We introduce a cross-attention distillation mechanism that condenses high-dimensional patches into explicit semantic nodes, naturally extracting spatial-aware cross-modal priors. Furthermore, we unify tailored GLasso estimation and Common-Specific Structure Learning (CSSL) into a joint objective optimized via the Alternating Direction Method of Multiplier (ADMM). This formulation guarantees the simultaneous disentanglement of invariant and class-specific precision matrices without multi-step error accumulation. Extensive experiments across eight benchmarks covering both natural and medical domains demonstrate that CM-GLasso establishes a new state-of-the-art in generative classification and dense semantic segmentation tasks.

Multimodal Structure Learning: Disentangling Shared and Specific Topology via Cross-Modal Graphical Lasso

Abstract

Learning interpretable multimodal representations inherently relies on uncovering the conditional dependencies between heterogeneous features. However, sparse graph estimation techniques, such as Graphical Lasso (GLasso), to visual-linguistic domains is severely bottlenecked by high-dimensional noise, modality misalignment, and the confounding of shared versus category-specific topologies. In this paper, we propose Cross-Modal Graphical Lasso (CM-GLasso) that overcomes these fundamental limitations. By coupling a novel text-visualization strategy with a unified vision-language encoder, we strictly align multimodal features into a shared latent space. We introduce a cross-attention distillation mechanism that condenses high-dimensional patches into explicit semantic nodes, naturally extracting spatial-aware cross-modal priors. Furthermore, we unify tailored GLasso estimation and Common-Specific Structure Learning (CSSL) into a joint objective optimized via the Alternating Direction Method of Multiplier (ADMM). This formulation guarantees the simultaneous disentanglement of invariant and class-specific precision matrices without multi-step error accumulation. Extensive experiments across eight benchmarks covering both natural and medical domains demonstrate that CM-GLasso establishes a new state-of-the-art in generative classification and dense semantic segmentation tasks.

Paper Structure

This paper contains 23 sections, 22 equations, 3 figures, 11 tables.

Figures (3)

  • Figure 1: The Pipeline of CM-GLasso. (a) Text visualization and unified SigLIP 2 feature extraction. (b) Cross-attention distillation condenses patches into semantic nodes. (c) Attention footprints derive spatial-aware cross-modal priors. (d-e) Nonparanormal transformation strictly ensures Gaussianity for the subsequent joint ADMM optimization, simultaneously disentangling shared ($\boldsymbol{\Theta}_{\text{com}}$) and specific ($\boldsymbol{S}^{(c)}$) topologies. (f) Learned structures govern generative classification and topology-aware segmentation.
  • Figure 2: GAM visualization of the classification head $\mathcal{H}_C$. For each dataset (CUB-200-2011, CIFAR-10, CIFAR-100, Caltech-256), four input images (left block) are paired with their GAM heatmaps (right block). Warm regions (red/yellow) indicate spatially discriminative areas; cool regions (blue) indicate low contribution. CM-GLasso consistently focuses on class-discriminative semantics (e.g., bird head/wings, vehicle contours) rather than background noise, validating that cross-modal prior guidance improves classification interpretability.
  • Figure 3: Qualitative segmentation results of $\mathcal{H}_S$. For each dataset (ADE20K, Kvasir-SEG, PASCAL VOC-2012, MS COCO-2014), three sample triplets are shown: input image, Ground Truth (GT), and CM-GLasso prediction (Ours). Our method produces precise boundaries and correctly captures long-range semantic dependencies (e.g., polyp edges, sky--water reflections, building--ground transitions), validating the benefit of joint ADMM optimization with cross-modal prior guidance.