Hierarchical Mutual Distillation for Multi-View Fusion: Learning from All Possible View Combinations
Jiwoong Yang, Haejun Chung, Ikbeom Jang
TL;DR
This work tackles the challenge of leveraging arbitrary, unstructured multi-view data to improve classification by modeling all possible view combinations. It introduces HMDMV, a CNN-Transformer–based framework that performs all-views fusion through uncertainty-weighted aggregation and hierarchical mutual distillation, aligning single-view and partial-view predictions with full multi-view predictions. Empirical results on Hotels-8k, GLDv2, and Carvana demonstrate state-of-the-art accuracy and robustness across structured and unstructured settings, with ablations validating the contributions of partial-view training, uncertainty weighting, and the hierarchical distillation loss. The method also offers practical inference-time flexibility across varying numbers of views, though it incurs higher training cost due to exhaustive view combinations; future work includes subset-based view selection and extending to multi-modal tasks.
Abstract
Multi-view learning often faces challenges in effectively leveraging images captured from different angles and locations. This challenge is particularly pronounced when addressing inconsistencies and uncertainties between views. In this paper, we propose a novel Multi-View Uncertainty-Weighted Mutual Distillation (MV-UWMD) method. Our method enhances prediction consistency by performing hierarchical mutual distillation across all possible view combinations, including single-view, partial multi-view, and full multi-view predictions. This introduces an uncertainty-based weighting mechanism through mutual distillation, allowing effective exploitation of unique information from each view while mitigating the impact of uncertain predictions. We extend a CNN-Transformer hybrid architecture to facilitate robust feature learning and integration across multiple view combinations. We conducted extensive experiments using a large, unstructured dataset captured from diverse, non-fixed viewpoints. The results demonstrate that MV-UWMD improves prediction accuracy and consistency compared to existing multi-view learning approaches.
