Table of Contents
Fetching ...

Hierarchical Mutual Distillation for Multi-View Fusion: Learning from All Possible View Combinations

Jiwoong Yang, Haejun Chung, Ikbeom Jang

TL;DR

This work tackles the challenge of leveraging arbitrary, unstructured multi-view data to improve classification by modeling all possible view combinations. It introduces HMDMV, a CNN-Transformer–based framework that performs all-views fusion through uncertainty-weighted aggregation and hierarchical mutual distillation, aligning single-view and partial-view predictions with full multi-view predictions. Empirical results on Hotels-8k, GLDv2, and Carvana demonstrate state-of-the-art accuracy and robustness across structured and unstructured settings, with ablations validating the contributions of partial-view training, uncertainty weighting, and the hierarchical distillation loss. The method also offers practical inference-time flexibility across varying numbers of views, though it incurs higher training cost due to exhaustive view combinations; future work includes subset-based view selection and extending to multi-modal tasks.

Abstract

Multi-view learning often faces challenges in effectively leveraging images captured from different angles and locations. This challenge is particularly pronounced when addressing inconsistencies and uncertainties between views. In this paper, we propose a novel Multi-View Uncertainty-Weighted Mutual Distillation (MV-UWMD) method. Our method enhances prediction consistency by performing hierarchical mutual distillation across all possible view combinations, including single-view, partial multi-view, and full multi-view predictions. This introduces an uncertainty-based weighting mechanism through mutual distillation, allowing effective exploitation of unique information from each view while mitigating the impact of uncertain predictions. We extend a CNN-Transformer hybrid architecture to facilitate robust feature learning and integration across multiple view combinations. We conducted extensive experiments using a large, unstructured dataset captured from diverse, non-fixed viewpoints. The results demonstrate that MV-UWMD improves prediction accuracy and consistency compared to existing multi-view learning approaches.

Hierarchical Mutual Distillation for Multi-View Fusion: Learning from All Possible View Combinations

TL;DR

This work tackles the challenge of leveraging arbitrary, unstructured multi-view data to improve classification by modeling all possible view combinations. It introduces HMDMV, a CNN-Transformer–based framework that performs all-views fusion through uncertainty-weighted aggregation and hierarchical mutual distillation, aligning single-view and partial-view predictions with full multi-view predictions. Empirical results on Hotels-8k, GLDv2, and Carvana demonstrate state-of-the-art accuracy and robustness across structured and unstructured settings, with ablations validating the contributions of partial-view training, uncertainty weighting, and the hierarchical distillation loss. The method also offers practical inference-time flexibility across varying numbers of views, though it incurs higher training cost due to exhaustive view combinations; future work includes subset-based view selection and extending to multi-modal tasks.

Abstract

Multi-view learning often faces challenges in effectively leveraging images captured from different angles and locations. This challenge is particularly pronounced when addressing inconsistencies and uncertainties between views. In this paper, we propose a novel Multi-View Uncertainty-Weighted Mutual Distillation (MV-UWMD) method. Our method enhances prediction consistency by performing hierarchical mutual distillation across all possible view combinations, including single-view, partial multi-view, and full multi-view predictions. This introduces an uncertainty-based weighting mechanism through mutual distillation, allowing effective exploitation of unique information from each view while mitigating the impact of uncertain predictions. We extend a CNN-Transformer hybrid architecture to facilitate robust feature learning and integration across multiple view combinations. We conducted extensive experiments using a large, unstructured dataset captured from diverse, non-fixed viewpoints. The results demonstrate that MV-UWMD improves prediction accuracy and consistency compared to existing multi-view learning approaches.

Paper Structure

This paper contains 18 sections, 12 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: (a) Hotels-8k kamath20212021 and (b) Google Landmarks Dataset v2 weyand2020google are unstructured data with varying angles and environments. (c) Carvana carvana-image-masking-challenge is structured data with fixed angles and environments.
  • Figure 2: (a) Unstructured multi-view of a hotel room, where images are taken from various arbitrary viewpoints. (b) Structured multi-view of a car, with images captured from fixed, predefined camera positions around the car.
  • Figure 3: Overview of HMDMV Method. Multi-view images serve as input and are processed by a hybrid CNN-Transformer network. (a) Feature tokens extracted from the $n$ input views are concatenated to form all possible combinations of $k$ views (where $k$ denotes the number of views in each combination), defining the combination set $C_k$. This set includes $\binom{n}{k}$ subsets. (b) For each $C_k$, the predictions $p_k^i$ in $p(C_k)$ are fused via uncertainty-weighted averaging to yield a prediction $P(C_k)$. (c) Hierarchical mutual distillation is performed sequentially by aligning the predictions of each combination set for $1\!\leq k\!< n$ with that of the full multi-view prediction $P(C_n)$.
  • Figure 4: Inference with Varying View Counts. The number of test views $n_t$ may differ from the training-views count $n$. If $n_t\!<\!n$, missing views are duplicated to form $n$-view inputs. If $n_t\!=\!n$, inference matches training. If $n_t\!>\!n$, multiple $n$-view subsets are sampled and their predictions are ensembled. This ensures consistent inference regardless of the number of given views.
  • Figure 5: Performance comparison of multi-view methods. We present the visualization of results from \ref{['tab:resulttable_1']} (a) for fixed multi-view and \ref{['tab:resulttable_2']} (b) for unstructured multi-view.