Table of Contents
Fetching ...

CaReFlow: Cyclic Adaptive Rectified Flow for Multimodal Fusion

Sijie Mai, Shiqin Han

TL;DR

This work extends rectified flow to achieve more accurate distribution mapping and address the ambiguous flow directions in one-to-many mapping, and designs `adaptive relaxed alignment', enforcing stricter alignment for modality pairs belonging to the same sample, while applying relaxed mapping for pairs not belonging to the same sample or category.

Abstract

Modality gap significantly restricts the effectiveness of multimodal fusion. Previous methods often use techniques such as diffusion models and adversarial learning to reduce the modality gap, but they typically focus on one-to-one alignment without exposing the data points of the source modality to the global distribution information of the target modality. To this end, leveraging the characteristic of rectified flow that can map one distribution to another via a straight trajectory, we extend rectified flow for modality distribution mapping. Specifically, we leverage the `one-to-many mapping' strategy in rectified flow that allows each data point of the source modality to observe the overall target distribution. This also alleviates the issue of insufficient paired data within each sample, enabling a more robust distribution transformation. Moreover, to achieve more accurate distribution mapping and address the ambiguous flow directions in one-to-many mapping, we design `adaptive relaxed alignment', enforcing stricter alignment for modality pairs belonging to the same sample, while applying relaxed mapping for pairs not belonging to the same sample or category. Additionally, to prevent information loss during distribution mapping, we introduce `cyclic rectified flow' to ensure the transferred features can be translated back to the original features, allowing multimodal representations to learn sufficient modality-specific information. After distribution alignment, our approach achieves very competitive results on multiple tasks of multimodal affective computing even with a simple fusion method, and visualizations verify that it can effectively reduce the modality gap.

CaReFlow: Cyclic Adaptive Rectified Flow for Multimodal Fusion

TL;DR

This work extends rectified flow to achieve more accurate distribution mapping and address the ambiguous flow directions in one-to-many mapping, and designs `adaptive relaxed alignment', enforcing stricter alignment for modality pairs belonging to the same sample, while applying relaxed mapping for pairs not belonging to the same sample or category.

Abstract

Modality gap significantly restricts the effectiveness of multimodal fusion. Previous methods often use techniques such as diffusion models and adversarial learning to reduce the modality gap, but they typically focus on one-to-one alignment without exposing the data points of the source modality to the global distribution information of the target modality. To this end, leveraging the characteristic of rectified flow that can map one distribution to another via a straight trajectory, we extend rectified flow for modality distribution mapping. Specifically, we leverage the `one-to-many mapping' strategy in rectified flow that allows each data point of the source modality to observe the overall target distribution. This also alleviates the issue of insufficient paired data within each sample, enabling a more robust distribution transformation. Moreover, to achieve more accurate distribution mapping and address the ambiguous flow directions in one-to-many mapping, we design `adaptive relaxed alignment', enforcing stricter alignment for modality pairs belonging to the same sample, while applying relaxed mapping for pairs not belonging to the same sample or category. Additionally, to prevent information loss during distribution mapping, we introduce `cyclic rectified flow' to ensure the transferred features can be translated back to the original features, allowing multimodal representations to learn sufficient modality-specific information. After distribution alignment, our approach achieves very competitive results on multiple tasks of multimodal affective computing even with a simple fusion method, and visualizations verify that it can effectively reduce the modality gap.
Paper Structure (20 sections, 12 equations, 6 figures, 5 tables)

This paper contains 20 sections, 12 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: (a) shows that there are significant differences in the distribution of different modalities in the feature space, and (b) show that vanilla multimodal model (using feature concatenation and MLP networks for fusion) even underperforms language-only model. After applying our CaReFlow to reduce modality gap, multimodal model significantly outperforms language-only model.
  • Figure 2: The comparison between tradition methods, rectified flow liuflow , and our CaReFlow. CaReFlow implements one-to-many mapping, adaptive relaxed alignment, and cyclic information flow to improve the effect of modality alignment.
  • Figure 3: (a) The diagram of the model pipeline and (b) the forward and backward information flows in CaReFlow.
  • Figure 4: Results on (a) UR-FUNNY and (b) MUStARD datasets.
  • Figure 5: The visualizations for distributions of unimodal features using t-SNE tsne. CL denotes contrastive learning.
  • ...and 1 more figures