Table of Contents
Fetching ...

AdaptiveFusion: Adaptive Multi-Modal Multi-View Fusion for 3D Human Body Reconstruction

Anjun Chen, Xiangyu Wang, Zhi Xu, Kun Shi, Yan Qin, Yuchi Huo, Jiming Chen, Qi Ye

TL;DR

AdaptiveFusion is a generic adaptive multi-modal multi-view fusion framework that can effectively incorporate arbitrary combinations of uncalibrated sensor inputs and is able to cope with arbitrary numbers of inputs and accommodate noisy modalities with only a single training network.

Abstract

Recent advancements in sensor technology and deep learning have led to significant progress in 3D human body reconstruction. However, most existing approaches rely on data from a specific sensor, which can be unreliable due to the inherent limitations of individual sensing modalities. Additionally, existing multi-modal fusion methods generally require customized designs based on the specific sensor combinations or setups, which limits the flexibility and generality of these methods. Furthermore, conventional point-image projection-based and Transformer-based fusion networks are susceptible to the influence of noisy modalities and sensor poses. To address these limitations and achieve robust 3D human body reconstruction in various conditions, we propose AdaptiveFusion, a generic adaptive multi-modal multi-view fusion framework that can effectively incorporate arbitrary combinations of uncalibrated sensor inputs. By treating different modalities from various viewpoints as equal tokens, and our handcrafted modality sampling module by leveraging the inherent flexibility of Transformer models, AdaptiveFusion is able to cope with arbitrary numbers of inputs and accommodate noisy modalities with only a single training network. Extensive experiments on large-scale human datasets demonstrate the effectiveness of AdaptiveFusion in achieving high-quality 3D human body reconstruction in various environments. In addition, our method achieves superior accuracy compared to state-of-the-art fusion methods.

AdaptiveFusion: Adaptive Multi-Modal Multi-View Fusion for 3D Human Body Reconstruction

TL;DR

AdaptiveFusion is a generic adaptive multi-modal multi-view fusion framework that can effectively incorporate arbitrary combinations of uncalibrated sensor inputs and is able to cope with arbitrary numbers of inputs and accommodate noisy modalities with only a single training network.

Abstract

Recent advancements in sensor technology and deep learning have led to significant progress in 3D human body reconstruction. However, most existing approaches rely on data from a specific sensor, which can be unreliable due to the inherent limitations of individual sensing modalities. Additionally, existing multi-modal fusion methods generally require customized designs based on the specific sensor combinations or setups, which limits the flexibility and generality of these methods. Furthermore, conventional point-image projection-based and Transformer-based fusion networks are susceptible to the influence of noisy modalities and sensor poses. To address these limitations and achieve robust 3D human body reconstruction in various conditions, we propose AdaptiveFusion, a generic adaptive multi-modal multi-view fusion framework that can effectively incorporate arbitrary combinations of uncalibrated sensor inputs. By treating different modalities from various viewpoints as equal tokens, and our handcrafted modality sampling module by leveraging the inherent flexibility of Transformer models, AdaptiveFusion is able to cope with arbitrary numbers of inputs and accommodate noisy modalities with only a single training network. Extensive experiments on large-scale human datasets demonstrate the effectiveness of AdaptiveFusion in achieving high-quality 3D human body reconstruction in various environments. In addition, our method achieves superior accuracy compared to state-of-the-art fusion methods.
Paper Structure (26 sections, 6 equations, 4 figures, 9 tables)

This paper contains 26 sections, 6 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Comparison of different fusion methods. (a) Framework of our proposed AdaptiveFusion. We first extract global and local features from each of the sampled modalities using corresponding backbones. Next, we utilize Global Integrated Module to incorporate global features. Then, we employ Fusion Transformer Module to fuse global and local features and to regress locations of joints and vertices. D.R. MLP stands for a dimension reduction MLP. (b) DeepFusion li2022deepfusion. (c) TokenFusion wang2022multimodal. (d) FUTR3D chen2023futr3d
  • Figure 2: Comparison of AdaptiveFusion using different input combinations with other methods on the mmBody dataset. Img1 and dep1 denote the RGB images and depth point clouds from the first viewpoint. Img2 and dep2 are from the second viewpoint (no adverse conditions for this viewpoint).
  • Figure 3: Qualitative results. Each row represents an adverse weather scene (rain, smoke, poor lighting, and occlusion) and each column shows the reconstructed mesh and attention weights, respectively. From top to bottom, weights are for the estimation of the left ankle, right elbow, right ankle, and right shoulder. The darker color in the Vertices column indicates larger attention weights. The reddish color indicates larger attention weights and the bluish color smaller from the Image1 to Radar columns.
  • Figure 4: Failure cases in the furnished, smoke, and occlusion scene. (Radar points are in green and depth points are in orange. Depth points in the occlusion scene are from the other viewpoint which is not occluded.)