Table of Contents
Fetching ...

ADMN: A Layer-Wise Adaptive Multimodal Network for Dynamic Input Noise and Compute Resources

Jason Wu, Yuyang Yuan, Kang Yang, Lance Kaplan, Mani Srivastava

TL;DR

ADMN tackles dynamic QoI and strictly bounded compute budgets in multimodal networks by introducing layer-wise adaptive backbones and a QoI-aware controller that reassigns layers across modalities per sample. The method couples Stage 1 LayerDrop pretraining with Stage 2 controller training, employing corruption-aware supervision or autoencoder-based initialization and differentiable top-$L$ layer selection via Gumbel-Softmax. Empirical results on GDTM, MM-Fi, and AVE show that ADMN matches state-of-the-art accuracy while reducing FLOPs by up to 75% and latency by up to 60%, and generalizes to three modalities. This approach enables efficient, QoI-responsive multimodal inference suitable for dynamic hardware and sensor conditions, with code available for reproducibility.

Abstract

Multimodal deep learning systems are deployed in dynamic scenarios due to the robustness afforded by multiple sensing modalities. Nevertheless, they struggle with varying compute resource availability (due to multi-tenancy, device heterogeneity, etc.) and fluctuating quality of inputs (from sensor feed corruption, environmental noise, etc.). Statically provisioned multimodal systems cannot adapt when compute resources change over time, while existing dynamic networks struggle with strict compute budgets. Additionally, both systems often neglect the impact of variations in modality quality. Consequently, modalities suffering substantial corruption may needlessly consume resources better allocated towards other modalities. We propose ADMN, a layer-wise Adaptive Depth Multimodal Network capable of tackling both challenges: it adjusts the total number of active layers across all modalities to meet strict compute resource constraints and continually reallocates layers across input modalities according to their modality quality. Our evaluations showcase ADMN can match the accuracy of state-of-the-art networks while reducing up to 75% of their floating-point operations.

ADMN: A Layer-Wise Adaptive Multimodal Network for Dynamic Input Noise and Compute Resources

TL;DR

ADMN tackles dynamic QoI and strictly bounded compute budgets in multimodal networks by introducing layer-wise adaptive backbones and a QoI-aware controller that reassigns layers across modalities per sample. The method couples Stage 1 LayerDrop pretraining with Stage 2 controller training, employing corruption-aware supervision or autoencoder-based initialization and differentiable top- layer selection via Gumbel-Softmax. Empirical results on GDTM, MM-Fi, and AVE show that ADMN matches state-of-the-art accuracy while reducing FLOPs by up to 75% and latency by up to 60%, and generalizes to three modalities. This approach enables efficient, QoI-responsive multimodal inference suitable for dynamic hardware and sensor conditions, with code available for reproducibility.

Abstract

Multimodal deep learning systems are deployed in dynamic scenarios due to the robustness afforded by multiple sensing modalities. Nevertheless, they struggle with varying compute resource availability (due to multi-tenancy, device heterogeneity, etc.) and fluctuating quality of inputs (from sensor feed corruption, environmental noise, etc.). Statically provisioned multimodal systems cannot adapt when compute resources change over time, while existing dynamic networks struggle with strict compute budgets. Additionally, both systems often neglect the impact of variations in modality quality. Consequently, modalities suffering substantial corruption may needlessly consume resources better allocated towards other modalities. We propose ADMN, a layer-wise Adaptive Depth Multimodal Network capable of tackling both challenges: it adjusts the total number of active layers across all modalities to meet strict compute resource constraints and continually reallocates layers across input modalities according to their modality quality. Our evaluations showcase ADMN can match the accuracy of state-of-the-art networks while reducing up to 75% of their floating-point operations.

Paper Structure

This paper contains 31 sections, 1 equation, 19 figures, 13 tables.

Figures (19)

  • Figure 1: Overview of ADMN. Variable depth backbones adapt to both changing compute resources and input noise characteristics
  • Figure 2: ADMN architecture. [Gray box]: dropped layer, [Blue box]: frozen layer, [Red box]: tunable layer. TE: Transformer Encoder.
  • Figure 3: Detailed depiction of the ADMN controller.
  • Figure 4: Latency (ms) and GFLOPs vs Layers for GDTM (left) and MM-Fi (Right)
  • Figure 5: t-SNE of the autoencoder for different levels of RGB Blur on GDTM Blur
  • ...and 14 more figures