Table of Contents
Fetching ...

Neuro-Inspired Information-Theoretic Hierarchical Perception for Multimodal Learning

Xiongye Xiao, Gengshuo Liu, Gaurav Gupta, Defu Cao, Shixuan Li, Yaxing Li, Tianqing Fang, Mingxi Cheng, Paul Bogdan

TL;DR

The paper presents Information-Theoretic Hierarchical Perception (ITHP), a neuro-inspired framework for multimodal learning that designates a prime modality and distills information via a two-level Information Bottleneck (IB) hierarchy. By optimizing $\mathcal{L}_{IB_0} = I(X_0; B_0) - \beta I(B_0; X_1)$ and $\mathcal{L}_{IB_1} = I(B_0; B_1) - \gamma I(B_1; X_2)$ with variational approximations and combining them into a joint objective, ITHP constructs compact, informative latent representations for downstream tasks. Experiments on MUStARD, CMU-MOSI, and CMU-MOSEI show state-of-the-art performance, including surpassing human-level benchmarks on CMU-MOSI in multimodal sentiment classification when using ITHP-DeBERTa. The approach demonstrates robust cross-modal information distillation with efficient inference, though it relies on a predefined modality order and may require extensions to handle missing modalities or learn ordering adaptively.

Abstract

Integrating and processing information from various sources or modalities are critical for obtaining a comprehensive and accurate perception of the real world in autonomous systems and cyber-physical systems. Drawing inspiration from neuroscience, we develop the Information-Theoretic Hierarchical Perception (ITHP) model, which utilizes the concept of information bottleneck. Different from most traditional fusion models that incorporate all modalities identically in neural networks, our model designates a prime modality and regards the remaining modalities as detectors in the information pathway, serving to distill the flow of information. Our proposed perception model focuses on constructing an effective and compact information flow by achieving a balance between the minimization of mutual information between the latent state and the input modal state, and the maximization of mutual information between the latent states and the remaining modal states. This approach leads to compact latent state representations that retain relevant information while minimizing redundancy, thereby substantially enhancing the performance of multimodal representation learning. Experimental evaluations on the MUStARD, CMU-MOSI, and CMU-MOSEI datasets demonstrate that our model consistently distills crucial information in multimodal learning scenarios, outperforming state-of-the-art benchmarks. Remarkably, on the CMU-MOSI dataset, ITHP surpasses human-level performance in the multimodal sentiment binary classification task across all evaluation metrics (i.e., Binary Accuracy, F1 Score, Mean Absolute Error, and Pearson Correlation).

Neuro-Inspired Information-Theoretic Hierarchical Perception for Multimodal Learning

TL;DR

The paper presents Information-Theoretic Hierarchical Perception (ITHP), a neuro-inspired framework for multimodal learning that designates a prime modality and distills information via a two-level Information Bottleneck (IB) hierarchy. By optimizing and with variational approximations and combining them into a joint objective, ITHP constructs compact, informative latent representations for downstream tasks. Experiments on MUStARD, CMU-MOSI, and CMU-MOSEI show state-of-the-art performance, including surpassing human-level benchmarks on CMU-MOSI in multimodal sentiment classification when using ITHP-DeBERTa. The approach demonstrates robust cross-modal information distillation with efficient inference, though it relies on a predefined modality order and may require extensions to handle missing modalities or learn ordering adaptively.

Abstract

Integrating and processing information from various sources or modalities are critical for obtaining a comprehensive and accurate perception of the real world in autonomous systems and cyber-physical systems. Drawing inspiration from neuroscience, we develop the Information-Theoretic Hierarchical Perception (ITHP) model, which utilizes the concept of information bottleneck. Different from most traditional fusion models that incorporate all modalities identically in neural networks, our model designates a prime modality and regards the remaining modalities as detectors in the information pathway, serving to distill the flow of information. Our proposed perception model focuses on constructing an effective and compact information flow by achieving a balance between the minimization of mutual information between the latent state and the input modal state, and the maximization of mutual information between the latent states and the remaining modal states. This approach leads to compact latent state representations that retain relevant information while minimizing redundancy, thereby substantially enhancing the performance of multimodal representation learning. Experimental evaluations on the MUStARD, CMU-MOSI, and CMU-MOSEI datasets demonstrate that our model consistently distills crucial information in multimodal learning scenarios, outperforming state-of-the-art benchmarks. Remarkably, on the CMU-MOSI dataset, ITHP surpasses human-level performance in the multimodal sentiment binary classification task across all evaluation metrics (i.e., Binary Accuracy, F1 Score, Mean Absolute Error, and Pearson Correlation).
Paper Structure (36 sections, 30 equations, 6 figures, 11 tables, 4 algorithms)

This paper contains 36 sections, 30 equations, 6 figures, 11 tables, 4 algorithms.

Figures (6)

  • Figure 1: Constructing two latent states, $B_0$ and $B_1$, facilitates the transfer of pertinent information among three modal states $X_0$, $X_1$, and $X_2$.
  • Figure 2: An illustration of the proposed model architecture. In each encoder, we have two MLP layers: the initial layer extracts the feature vectors from input states, while the second layer generates parameters for the latent Gaussian distribution. The Venn diagrams illustrate the information constraint from the optimization problem (\ref{['eqn:optimProb']}).
  • Figure 3: A schematic representation of our proposed ITHP and its information flow. The diagram illustrates the process of feature extraction from multimodal embedding data including video frames, text, and audio patterns. These modalities pass through a "Feature Extraction" phase, where they are embedded to get modal states $X_0$, $X_1$, and $X_2$. The derived states are then processed to construct latent states $B_0$ and $B_1$. This processing includes reciprocal information exchange between $X_1$ and $B_0$, as well as between $B_1$ and $X_2$. The resulting information from this process is then used to make a determination about the presence of sarcasm.
  • Figure 4: Weighted precision and recall for the binary classification task under varying Lagrange multipliers. The graph shows the impact of varying the Lagrange multipliers ($\beta$ and $\gamma$). For each plot, the Red (Orange) color denotes the highest (lowest) value, respectively.
  • Figure 5: $N$ latent states are constructed for extracting and transferring the most relevant information of a set of $N+1$ modal states. The modal states are represented by $X_0$, $X_1$, ..., $X_{N-1}$, and $X_N$. This order is determined by the richness of the amount of information contained in the modalities. The latent states $B_0$, $B_1$, ..., $B_{N-1}$ represent the pathway for transferring the relevant information among $X$. Through the hierarchical structure, the most relevant information from $X_0$ to $X_N$ is gradually retained and distilled.
  • ...and 1 more figures