Table of Contents
Fetching ...

STiL: Semi-supervised Tabular-Image Learning for Comprehensive Task-Relevant Information Exploration in Multimodal Classification

Siyi Du, Xinzhe Luo, Declan P. O'Regan, Chen Qin

TL;DR

STiL is a novel SemiSL tabular-image framework that addresses the Modality Information Gap by comprehensively exploring task-relevant information and features a new disentangled contrastive consistency module to learn cross-modal invariant representations of shared information while retaining modality-specific information via disentanglement.

Abstract

Multimodal image-tabular learning is gaining attention, yet it faces challenges due to limited labeled data. While earlier work has applied self-supervised learning (SSL) to unlabeled data, its task-agnostic nature often results in learning suboptimal features for downstream tasks. Semi-supervised learning (SemiSL), which combines labeled and unlabeled data, offers a promising solution. However, existing multimodal SemiSL methods typically focus on unimodal or modality-shared features, ignoring valuable task-relevant modality-specific information, leading to a Modality Information Gap. In this paper, we propose STiL, a novel SemiSL tabular-image framework that addresses this gap by comprehensively exploring task-relevant information. STiL features a new disentangled contrastive consistency module to learn cross-modal invariant representations of shared information while retaining modality-specific information via disentanglement. We also propose a novel consensus-guided pseudo-labeling strategy to generate reliable pseudo-labels based on classifier consensus, along with a new prototype-guided label smoothing technique to refine pseudo-label quality with prototype embeddings, thereby enhancing task-relevant information learning in unlabeled data. Experiments on natural and medical image datasets show that STiL outperforms the state-of-the-art supervised/SSL/SemiSL image/multimodal approaches. Our code is available at https://github.com/siyi-wind/STiL.

STiL: Semi-supervised Tabular-Image Learning for Comprehensive Task-Relevant Information Exploration in Multimodal Classification

TL;DR

STiL is a novel SemiSL tabular-image framework that addresses the Modality Information Gap by comprehensively exploring task-relevant information and features a new disentangled contrastive consistency module to learn cross-modal invariant representations of shared information while retaining modality-specific information via disentanglement.

Abstract

Multimodal image-tabular learning is gaining attention, yet it faces challenges due to limited labeled data. While earlier work has applied self-supervised learning (SSL) to unlabeled data, its task-agnostic nature often results in learning suboptimal features for downstream tasks. Semi-supervised learning (SemiSL), which combines labeled and unlabeled data, offers a promising solution. However, existing multimodal SemiSL methods typically focus on unimodal or modality-shared features, ignoring valuable task-relevant modality-specific information, leading to a Modality Information Gap. In this paper, we propose STiL, a novel SemiSL tabular-image framework that addresses this gap by comprehensively exploring task-relevant information. STiL features a new disentangled contrastive consistency module to learn cross-modal invariant representations of shared information while retaining modality-specific information via disentanglement. We also propose a novel consensus-guided pseudo-labeling strategy to generate reliable pseudo-labels based on classifier consensus, along with a new prototype-guided label smoothing technique to refine pseudo-label quality with prototype embeddings, thereby enhancing task-relevant information learning in unlabeled data. Experiments on natural and medical image datasets show that STiL outperforms the state-of-the-art supervised/SSL/SemiSL image/multimodal approaches. Our code is available at https://github.com/siyi-wind/STiL.

Paper Structure

This paper contains 14 sections, 12 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: (a) Existing image-tabular pipelines using unlabeled data. (b) Illustration of the Information Modality Gap: task-relevant information exists in both shared and specific features. (c) STiL's framework, which addresses this gap and effectively learns task-relevant information from labeled and unlabeled data.
  • Figure 2: Overall framework of STiL. STiL encodes image-tabular data using $\phi$, decomposes modality-shared and -specific information through DCC $\psi$ (a), and outputs predictions via multimodal and unimodal classifiers $f$. STiL generates pseudo-labels for unlabeled data using CGPL (b) and refines them with prototype similarity scores in PGLS (c). (d) Training pathways for labeled and unlabeled data.
  • Figure 3: Plots of different methods on 1% labeled DVM: (a) accuracy of the confident pseudo-labels, where $\max \bar{\boldsymbol{p}}^m \geq \tau$; (b) ratio of the unlabeled samples with confident pseudo-labels. (c) accuracy of the smoothness term ($\boldsymbol{q}$ in \ref{['eq:smooth']}) on samples with confident pseudo-labels; and (d) accuracy of $\boldsymbol{q}$ on all unlabeled data samples.
  • Figure 4: t-SNE visualization of the multimodal embedding $\boldsymbol{v}$ for STiL trained on 1% labeled DVM or 10% labeled Infarction.
  • Figure 5: Results of STiL on 1% labeled DVM with varying (a) weight $\alpha$ for $\mathcal{L}_{ce}$, (b) weight $\lambda_p$ for $\mathcal{L}_{pt}$, (c) weight $\lambda_u$ for $\mathcal{L}_{uce}$, and (d) smoothness parameter $r$ in PGLS.
  • ...and 4 more figures