Cross-Modality Clustering-based Self-Labeling for Multimodal Data Classification

Paweł Zyblewski; Leandro L. Minku

Cross-Modality Clustering-based Self-Labeling for Multimodal Data Classification

Paweł Zyblewski, Leandro L. Minku

TL;DR

The paper tackles the high labeling cost in multimodal data classification by introducing Cross-Modality Clustering-based Self-Labeling (cmcsl), which performs independent clustering in each modality (e.g., $X_V$ and $X_T$) using a small labeling budget per class and propagates labels across clusters. Disagreements between modalities are resolved by selecting the label from the modality whose cluster centroid is nearer in Euclidean distance, and separate per-modality classifiers are trained on the resulting pseudo-labels. Through extensive experiments on 20 MM-IMDb subsets, cmcsl demonstrates that cross-modal label propagation can improve generalization for modality-specific classifiers, especially when labeled data are scarce, with Gaussian Naive Bayes often achieving the strongest gains. The findings emphasize the value of leveraging complementary information across modalities during self-labeling and highlight the importance of preprocessing to align deep feature spaces; future work includes data streams, applying cross-modal propagation to other self-labeling approaches, and exploring additional modalities.

Abstract

Technological advances facilitate the ability to acquire multimodal data, posing a challenge for recognition systems while also providing an opportunity to use the heterogeneous nature of the information to increase the generalization capability of models. An often overlooked issue is the cost of the labeling process, which is typically high due to the need for a significant investment in time and money associated with human experts. Existing semi-supervised learning methods often focus on operating in the feature space created by the fusion of available modalities, neglecting the potential for cross-utilizing complementary information available in each modality. To address this problem, we propose Cross-Modality Clustering-based Self-Labeling (CMCSL). Based on a small set of pre-labeled data, CMCSL groups instances belonging to each modality in the deep feature space and then propagates known labels within the resulting clusters. Next, information about the instances' class membership in each modality is exchanged based on the Euclidean distance to ensure more accurate labeling. Experimental evaluation conducted on 20 datasets derived from the MM-IMDb dataset indicates that cross-propagation of labels between modalities -- especially when the number of pre-labeled instances is small -- can allow for more reliable labeling and thus increase the classification performance in each modality.

Cross-Modality Clustering-based Self-Labeling for Multimodal Data Classification

TL;DR

and

) using a small labeling budget per class and propagates labels across clusters. Disagreements between modalities are resolved by selecting the label from the modality whose cluster centroid is nearer in Euclidean distance, and separate per-modality classifiers are trained on the resulting pseudo-labels. Through extensive experiments on 20 MM-IMDb subsets, cmcsl demonstrates that cross-modal label propagation can improve generalization for modality-specific classifiers, especially when labeled data are scarce, with Gaussian Naive Bayes often achieving the strongest gains. The findings emphasize the value of leveraging complementary information across modalities during self-labeling and highlight the importance of preprocessing to align deep feature spaces; future work includes data streams, applying cross-modal propagation to other self-labeling approaches, and exploring additional modalities.

Abstract

Paper Structure (13 sections, 5 figures, 9 tables, 1 algorithm)

This paper contains 13 sections, 5 figures, 9 tables, 1 algorithm.

Introduction
Related works
Classifying multimodal data with missing labels
Preprocessing for data clustering
Gaps in the literature & motivation
Cross-Modality Clustering-based Self-Labeling
Experimental evaluation
Set-up
Experiment scenarios
Experiment 1 -- Classification algorithm selection
Experiment 2 -- Preprocessing impact
Experiment 3 -- Comparison of cmcsl with reference methods
Conclusion

Figures (5)

Figure 1: The general scheme of cmcsl along with the procedure of encoder fine-tuning and feature extraction from individual modalities.
Figure 2: Visualization of the label distribution for a four-class cdfs (Crime, Documentary, Fantasy and Sci-Fi) subset after the transition from multilabel to multiclass. The final subset is composed only of selected, non-overlapping classes.
Figure 3: Visualization of the impact of data preprocessing on the distribution of classes in the space of average samples and the balanced accuracy score of the classifer trained on cmcsl pseudo-labels for AT dataset.
Figure 4: BAC in relation to the number of labeled samples for four example datasets.
Figure 5: Visualization of BAC values versus number of labeled samples averaged for binary and multiclass datasets.

Cross-Modality Clustering-based Self-Labeling for Multimodal Data Classification

TL;DR

Abstract

Cross-Modality Clustering-based Self-Labeling for Multimodal Data Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (5)