Table of Contents
Fetching ...

Common Practices and Taxonomy in Deep Multi-view Fusion for Remote Sensing Applications

Francisco Mena, Diego Arenas, Marlon Nuske, Andreas Dengel

TL;DR

The paper surveys deep multi-view fusion for remote sensing, aiming to unify terminology and distill common practices across supervised EO tasks. It covers where, how, and what to fuse, detailing input-, feature-, and decision-level fusion, merge functions, and architectural choices for per-view encoders, regularization, and auxiliary losses. Empirically, feature-level fusion often delivers strong performance, optical views dominate while SAR/LiDAR/DSM provide valuable complements, and additional views generally boost predictive accuracy, though results vary by task and data. The work also outlines open challenges, including missing-view robustness, uncertainty quantification, and explainability, calling for standardized benchmarks and clearer comparisons of fusion strategies to advance the field.

Abstract

The advances in remote sensing technologies have boosted applications for Earth observation. These technologies provide multiple observations or views with different levels of information. They might contain static or temporary views with different levels of resolution, in addition to having different types and amounts of noise due to sensor calibration or deterioration. A great variety of deep learning models have been applied to fuse the information from these multiple views, known as deep multi-view or multi-modal fusion learning. However, the approaches in the literature vary greatly since different terminology is used to refer to similar concepts or different illustrations are given to similar techniques. This article gathers works on multi-view fusion for Earth observation by focusing on the common practices and approaches used in the literature. We summarize and structure insights from several different publications concentrating on unifying points and ideas. In this manuscript, we provide a harmonized terminology while at the same time mentioning the various alternative terms that are used in literature. The topics covered by the works reviewed focus on supervised learning with the use of neural network models. We hope this review, with a long list of recent references, can support future research and lead to a unified advance in the area.

Common Practices and Taxonomy in Deep Multi-view Fusion for Remote Sensing Applications

TL;DR

The paper surveys deep multi-view fusion for remote sensing, aiming to unify terminology and distill common practices across supervised EO tasks. It covers where, how, and what to fuse, detailing input-, feature-, and decision-level fusion, merge functions, and architectural choices for per-view encoders, regularization, and auxiliary losses. Empirically, feature-level fusion often delivers strong performance, optical views dominate while SAR/LiDAR/DSM provide valuable complements, and additional views generally boost predictive accuracy, though results vary by task and data. The work also outlines open challenges, including missing-view robustness, uncertainty quantification, and explainability, calling for standardized benchmarks and clearer comparisons of fusion strategies to advance the field.

Abstract

The advances in remote sensing technologies have boosted applications for Earth observation. These technologies provide multiple observations or views with different levels of information. They might contain static or temporary views with different levels of resolution, in addition to having different types and amounts of noise due to sensor calibration or deterioration. A great variety of deep learning models have been applied to fuse the information from these multiple views, known as deep multi-view or multi-modal fusion learning. However, the approaches in the literature vary greatly since different terminology is used to refer to similar concepts or different illustrations are given to similar techniques. This article gathers works on multi-view fusion for Earth observation by focusing on the common practices and approaches used in the literature. We summarize and structure insights from several different publications concentrating on unifying points and ideas. In this manuscript, we provide a harmonized terminology while at the same time mentioning the various alternative terms that are used in literature. The topics covered by the works reviewed focus on supervised learning with the use of neural network models. We hope this review, with a long list of recent references, can support future research and lead to a unified advance in the area.
Paper Structure (15 sections, 4 figures, 4 tables)

This paper contains 15 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Illustration of three different fusion strategies: input-level fusion at the top, feature-level fusion at the middle, and decision-level fusion at the bottom. The forward pass of the model is from left to right (green arrows), while the backward pass is from right to left (red dashed arrows). VE stands for view-encoder and PM for predictive model.
  • Figure 2: Illustration of additional fusion strategies found in the literature: central-feature strategy at the top, and hybrid strategy at the bottom. The forward pass of the model is from left to right (green arrows), while the backward pass is from right to left (red dashed arrows). The "predictive model F" represent the predictive model that is fed with the fused representation.
  • Figure 3: The number of papers that show empirical evidence of being best, in-between or worst predictive performance within three fusion strategies (input, feature, and decision fusion). Individual papers are in Table \ref{['sup:tab:where_fuse']}.
  • Figure 4: Illustration of additional components in feature-level fusion: auxiliary-loss on each view at the top, and reconstruction on each view at the bottom. The forward pass of the model is from left to right (green arrows), while the backward pass is from right to left (red dashed arrows). The dashed green arrows are the auxiliary forward, only used for training. The "preditice model F" represent the predictive model that is fed with the fused representation.