Table of Contents
Fetching ...

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

Yiyuan Zhang, Xiaohan Ding, Kaixiong Gong, Yixiao Ge, Ying Shan, Xiangyu Yue

TL;DR

This work tackles improving a transformer trained on a specific modality by leveraging irrelevant data from other modalities, a setting where samples across modalities do not need to be aligned. It introduces Multimodal Pathway (M2PT) and an inference-free mechanism called Cross-Modal Re-parameterization to couple target and auxiliary transformers via learnable pathway scales. Across image, video, point cloud, and audio tasks, M2PT yields consistent gains, including improved ImageNet accuracy and downstream metrics on COCO and ADE20K, as well as better performance in 3D and audio recognition, even when trained from scratch. The study demonstrates modality-complementary knowledge in transformers and highlights potential for data-scarce domains, while noting the need for theoretical grounding and extensions to other architectures.

Abstract

We propose to improve transformers of a specific modality with irrelevant data from other modalities, e.g., improve an ImageNet model with audio or point cloud datasets. We would like to highlight that the data samples of the target modality are irrelevant to the other modalities, which distinguishes our method from other works utilizing paired (e.g., CLIP) or interleaved data of different modalities. We propose a methodology named Multimodal Pathway - given a target modality and a transformer designed for it, we use an auxiliary transformer trained with data of another modality and construct pathways to connect components of the two models so that data of the target modality can be processed by both models. In this way, we utilize the universal sequence-to-sequence modeling abilities of transformers obtained from two modalities. As a concrete implementation, we use a modality-specific tokenizer and task-specific head as usual but utilize the transformer blocks of the auxiliary model via a proposed method named Cross-Modal Re-parameterization, which exploits the auxiliary weights without any inference costs. On the image, point cloud, video, and audio recognition tasks, we observe significant and consistent performance improvements with irrelevant data from other modalities. The code and models are available at https://github.com/AILab-CVC/M2PT.

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

TL;DR

This work tackles improving a transformer trained on a specific modality by leveraging irrelevant data from other modalities, a setting where samples across modalities do not need to be aligned. It introduces Multimodal Pathway (M2PT) and an inference-free mechanism called Cross-Modal Re-parameterization to couple target and auxiliary transformers via learnable pathway scales. Across image, video, point cloud, and audio tasks, M2PT yields consistent gains, including improved ImageNet accuracy and downstream metrics on COCO and ADE20K, as well as better performance in 3D and audio recognition, even when trained from scratch. The study demonstrates modality-complementary knowledge in transformers and highlights potential for data-scarce domains, while noting the need for theoretical grounding and extensions to other architectures.

Abstract

We propose to improve transformers of a specific modality with irrelevant data from other modalities, e.g., improve an ImageNet model with audio or point cloud datasets. We would like to highlight that the data samples of the target modality are irrelevant to the other modalities, which distinguishes our method from other works utilizing paired (e.g., CLIP) or interleaved data of different modalities. We propose a methodology named Multimodal Pathway - given a target modality and a transformer designed for it, we use an auxiliary transformer trained with data of another modality and construct pathways to connect components of the two models so that data of the target modality can be processed by both models. In this way, we utilize the universal sequence-to-sequence modeling abilities of transformers obtained from two modalities. As a concrete implementation, we use a modality-specific tokenizer and task-specific head as usual but utilize the transformer blocks of the auxiliary model via a proposed method named Cross-Modal Re-parameterization, which exploits the auxiliary weights without any inference costs. On the image, point cloud, video, and audio recognition tasks, we observe significant and consistent performance improvements with irrelevant data from other modalities. The code and models are available at https://github.com/AILab-CVC/M2PT.
Paper Structure (13 sections, 9 equations, 3 figures, 7 tables)

This paper contains 13 sections, 9 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Compared to the known paradigms which use well-aligned multimodal data, we focus on scenarios where the data samples are from multiple modalities but irrelevant, which is an open problem in the literature.
  • Figure 2: (Left) Framework of Multimodal Pathway Transformer (M2PT). We use point cloud and image modalities as an example. Common practices with transformers follow the same pipeline: using 1) tokenizers to convert the input data to sequences, 2) transformer blocks to process the sequences, and 3) heads to decode the sequences. We upgrade the sequence-to-sequence modeling by establishing pathways between the components of different modalities so processing the tokens of a specific modality can utilize the transformer blocks trained with another modality. (Middle) Conceptual design of M2PT, where the pathways are implemented by letting a linear layer (including the Query/Key/Value/projection layers in the attention block and those in the FFN block) in the target model cooperate with its counterpart in the auxiliary model. (Right) Cross-Modal Re-parameterization efficiently realizes M2PT by re-parameterizing the weights of the target model with those of the auxiliary model, introducing marginal training costs and completely no inference costs.
  • Figure 3: Consistent improvements brought by M2PT across each pair of four modalities - image, video, point cloud, and audio. The metrics are ImageNet-1K accuracy, Kinetics-400 accuracy, PartNet mIoU, and AudioSet accuracy, respectively. The numbers represent the percentage of improvement of M2PT models relative to the performance of baseline models that are pretrained with MAE-style methods he2022maskedpang2022maskedhuang2022maskedzhou2022audio on the four modalities, respectively.