Table of Contents
Fetching ...

ViT-Lens: Towards Omni-modal Representations

Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, Mike Zheng Shou

TL;DR

Vit-lens provides a unified solution for representation learning of increasing modalities with two appealing advantages: unlocking the great potential of pretrained- ViTs to novel modalities effectively with efficient parameters and data regime and enabling emergent down- stream capabilities through modality alignment and shared ViT parameters.

Abstract

Aiming to advance AI agents, large foundation models significantly improve reasoning and instruction execution, yet the current focus on vision and language neglects the potential of perceiving diverse modalities in open-world environments. However, the success of data-driven vision and language models is costly or even infeasible to be reproduced for rare modalities. In this paper, we present ViT-Lens-2 that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning them to a pre-defined space. Specifically, the modality-specific lens is tuned to project any-modal signals to an intermediate embedding space, which are then processed by a strong ViT with pre-trained visual knowledge. The encoded representations are optimized toward aligning with the modal-independent space, pre-defined by off-the-shelf foundation models. ViT-Lens-2 provides a unified solution for representation learning of increasing modalities with two appealing advantages: (i) Unlocking the great potential of pretrained ViTs to novel modalities effectively with efficient data regime; (ii) Enabling emergent downstream capabilities through modality alignment and shared ViT parameters. We tailor ViT-Lens-2 to learn representations for 3D point cloud, depth, audio, tactile and EEG, and set new state-of-the-art results across various understanding tasks, such as zero-shot classification. By seamlessly integrating ViT-Lens-2 into Multimodal Foundation Models, we enable Any-modality to Text and Image Generation in a zero-shot manner. Code and models are available at https://github.com/TencentARC/ViT-Lens.

ViT-Lens: Towards Omni-modal Representations

TL;DR

Vit-lens provides a unified solution for representation learning of increasing modalities with two appealing advantages: unlocking the great potential of pretrained- ViTs to novel modalities effectively with efficient parameters and data regime and enabling emergent down- stream capabilities through modality alignment and shared ViT parameters.

Abstract

Aiming to advance AI agents, large foundation models significantly improve reasoning and instruction execution, yet the current focus on vision and language neglects the potential of perceiving diverse modalities in open-world environments. However, the success of data-driven vision and language models is costly or even infeasible to be reproduced for rare modalities. In this paper, we present ViT-Lens-2 that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning them to a pre-defined space. Specifically, the modality-specific lens is tuned to project any-modal signals to an intermediate embedding space, which are then processed by a strong ViT with pre-trained visual knowledge. The encoded representations are optimized toward aligning with the modal-independent space, pre-defined by off-the-shelf foundation models. ViT-Lens-2 provides a unified solution for representation learning of increasing modalities with two appealing advantages: (i) Unlocking the great potential of pretrained ViTs to novel modalities effectively with efficient data regime; (ii) Enabling emergent downstream capabilities through modality alignment and shared ViT parameters. We tailor ViT-Lens-2 to learn representations for 3D point cloud, depth, audio, tactile and EEG, and set new state-of-the-art results across various understanding tasks, such as zero-shot classification. By seamlessly integrating ViT-Lens-2 into Multimodal Foundation Models, we enable Any-modality to Text and Image Generation in a zero-shot manner. Code and models are available at https://github.com/TencentARC/ViT-Lens.
Paper Structure (32 sections, 1 equation, 10 figures, 25 tables)

This paper contains 32 sections, 1 equation, 10 figures, 25 tables.

Figures (10)

  • Figure 1: Vit-Lens for omni-modal representation learning. A)Vit-Lens consistently enhances the performance of understanding tasks, such as classification, zero-shot classification (ZS) and linear probing (LP), across 3D point cloud(liu2023openshape), depth(girdhar2023imagebind), audio(girdhar2023imagebind), tactile(yang2022touch_and_go), and EEG(bai2023dreamdiffusion) modalities. The citations represent the compared previous methods. Further details in \ref{['sec:experiments']}. B) By plugging Vit-Lens into multimodal foundation models, it enables emergent applications "out-of-the-box", including Any-modality Captioning/QA, Any-modality-to-Image Generation and text-guided Any-modality-to-Image editing, to name a few.
  • Figure 2: Training Pipeline.Vit-Lens extends the capabilities of a pretrained-ViT to diverse modalities. For each novel modality, it firstly employs a Modality Embedding (ModEmbed) and a Lens to learn mapping modality-specific data into an intermediate embedding space. It subsequently employs a set of pretrained-ViT layers to encode the feature. Finally, the output feature is aligned with the feature extracted from the anchor data (image, text, etc.) of the new modality using an off-the-shelf foundation model.
  • Figure 3: Lens Architecture used in Vit-Lens.
  • Figure 4: Demonstration of integrating Vit-Lens to MFM. (A) Original overall pipeline of MFM for vision; (B) Illustration of plugging well-trained Lenses of different modalities to MFM, without additional instruction-following training.
  • Figure 5: Few-shot linear probing on depth and 3D point cloud. We mark the zero-shot classification performance on the y-axis. We train linear classifiers on fixed features for the $\ge\!1$-shot settings.
  • ...and 5 more figures