Table of Contents
Fetching ...

ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights

Weixian Lei, Yixiao Ge, Jianfeng Zhang, Dylan Sun, Kun Yi, Ying Shan, Mike Zheng Shou

TL;DR

ViT-Lens introduces a modality-agnostic framework that repurposes a frozen pretrained ViT as a universal multi-modal sensor. It uses a modality-specific embedding plus a Perceiver to map diverse inputs into the ViT input space and trains a multimodal contrastive objective to align these representations with a CLIP-derived anchor space. The approach achieves strong zero-shot 3D classification, surpassing prior SOTA on ModelNet40 and Objaverse-LVIS, and demonstrates emergent capabilities by enabling 3D perception within a multimodal LLM like InstructBLIP without task-specific fine-tuning. This work signals a scalable, data-efficient direction for omni-modal learning by leveraging existing foundation-model knowledge to rapidly assimilate new modalities and tasks.

Abstract

Though the success of CLIP-based training recipes in vision-language models, their scalability to more modalities (e.g., 3D, audio, etc.) is limited to large-scale data, which is expensive or even inapplicable for rare modalities. In this paper, we present ViT-Lens that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning to a pre-defined space. Specifically, the modality-specific lens is tuned to project multimodal signals to the shared embedding space, which are then processed by a strong ViT that carries pre-trained image knowledge. The encoded multimodal representations are optimized toward aligning with the modal-independent space, pre-defined by off-the-shelf foundation models. A well-trained lens with a ViT backbone has the potential to serve as one of these foundation models, supervising the learning of subsequent modalities. ViT-Lens provides a unified solution for representation learning of increasing modalities with two appealing benefits: (i) Exploiting the pretrained ViT across tasks and domains effectively with efficient data regime; (ii) Emergent downstream capabilities of novel modalities are demonstrated due to the modality alignment space. We evaluate ViT-Lens in the context of 3D as an initial verification. In zero-shot 3D classification, ViT-Lens achieves substantial improvements over previous state-of-the-art, showing 52.0% accuracy on Objaverse-LVIS, 87.4% on ModelNet40, and 60.6% on ScanObjectNN. Furthermore, we enable zero-shot 3D question-answering by simply integrating the trained 3D lens into the InstructBLIP model without any adaptation. We will release the results of ViT-Lens on more modalities in the near future.

ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights

TL;DR

ViT-Lens introduces a modality-agnostic framework that repurposes a frozen pretrained ViT as a universal multi-modal sensor. It uses a modality-specific embedding plus a Perceiver to map diverse inputs into the ViT input space and trains a multimodal contrastive objective to align these representations with a CLIP-derived anchor space. The approach achieves strong zero-shot 3D classification, surpassing prior SOTA on ModelNet40 and Objaverse-LVIS, and demonstrates emergent capabilities by enabling 3D perception within a multimodal LLM like InstructBLIP without task-specific fine-tuning. This work signals a scalable, data-efficient direction for omni-modal learning by leveraging existing foundation-model knowledge to rapidly assimilate new modalities and tasks.

Abstract

Though the success of CLIP-based training recipes in vision-language models, their scalability to more modalities (e.g., 3D, audio, etc.) is limited to large-scale data, which is expensive or even inapplicable for rare modalities. In this paper, we present ViT-Lens that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning to a pre-defined space. Specifically, the modality-specific lens is tuned to project multimodal signals to the shared embedding space, which are then processed by a strong ViT that carries pre-trained image knowledge. The encoded multimodal representations are optimized toward aligning with the modal-independent space, pre-defined by off-the-shelf foundation models. A well-trained lens with a ViT backbone has the potential to serve as one of these foundation models, supervising the learning of subsequent modalities. ViT-Lens provides a unified solution for representation learning of increasing modalities with two appealing benefits: (i) Exploiting the pretrained ViT across tasks and domains effectively with efficient data regime; (ii) Emergent downstream capabilities of novel modalities are demonstrated due to the modality alignment space. We evaluate ViT-Lens in the context of 3D as an initial verification. In zero-shot 3D classification, ViT-Lens achieves substantial improvements over previous state-of-the-art, showing 52.0% accuracy on Objaverse-LVIS, 87.4% on ModelNet40, and 60.6% on ScanObjectNN. Furthermore, we enable zero-shot 3D question-answering by simply integrating the trained 3D lens into the InstructBLIP model without any adaptation. We will release the results of ViT-Lens on more modalities in the near future.
Paper Structure (12 sections, 1 equation, 4 figures, 9 tables)

This paper contains 12 sections, 1 equation, 4 figures, 9 tables.

Figures (4)

  • Figure 1: (a) Illustration of Vit-lens.Vit-lens extends the capabilities of a pretrained-ViT to perceive and comprehend diverse modalities beyond 2D images. It achieves this by firstly employing Modality Embedding and the Perceiver architecture jaegle2021perceiver to map modality-specific data into the pretrained-ViT input space. Then the encoded output of ViT is aligned with the feature extracted from the data's anchor text/image/text-image, through an off-the-shelf foundation model. This novel approach enables a pretrained-ViT to integrate and understand diverse modalities beyond images while leveraging its knowledge from the pretraining to better comprehend and interpret these modalities. (b) Zero-shot 3D classification. Our Vit-lens outperforms the state-of-the-art methods on zero-shot 3D classification when pretrained on datasets introduced by ULIP xue2023ulip, ULIP2 xue2023ulip2 and OpenShape liu2023openshape respectively. (c) Emergent Downstream Abilities. By incorporating new modalities into the ViT of an off-the-shelf MLLM, Vit-lens empowers the LLM to understand novel modalities or their combinations, without any tailored instruction-following tuning.
  • Figure 2: Training pipeline of Vit-lens for 3D shape understanding.Vit-lens aligns the triplet of 3D point clouds, 2D rendered images, and textual descriptions to a unified feature space, defined by CLIP. It leverages the capabilities of a powerful pretrained vision language model, CLIP openai_clipcherti2022openclip, which is frozen during pretraining and provides a pre-aligned feature space. The 3D shape encoder consists of a point embedding layer, a Perceiver, and a pretrained CLIP-ViT, shared with the image encoder. To enhance 3D point cloud encoding, point embeddings are obtained and distilled through the Perceiver before feeding into the frozen CLIP-ViT to obtain the final representation. The training objective is to minimize the contrastive loss for aligning features in the shared feature space.
  • Figure 3: Zero-shot Classification Performance on ModelNet40. We assess the impact of model scaling using OpenAI-B16 and OpenAI-L14, and analyze the influence of pretraining datasets using $\blacktriangleright$ and $\blacktriangleright$.
  • Figure 4: Scaling model size and pretraining data: PointBERT vs. Vit-lens. In experiments, we compare PointBERT and the Vit-lens's encoder using identical pretraining dataset and CLIP model for alignment. Vit-lens exhibits superior zero-shot performance and scalability.