Table of Contents
Fetching ...

ModalChorus: Visual Probing and Alignment of Multi-modal Embeddings via Modal Fusion Map

Yilin Ye, Shishi Xiao, Xingchen Zeng, Wei Zeng

TL;DR

ModalChorus is an interactive system for visual probing and alignment of multi-modal embeddings that can facilitate intuitive discovery of misalignment and efficient re-alignment in scenarios ranging from zero-shot classification to cross-modal retrieval and generation.

Abstract

Multi-modal embeddings form the foundation for vision-language models, such as CLIP embeddings, the most widely used text-image embeddings. However, these embeddings are vulnerable to subtle misalignment of cross-modal features, resulting in decreased model performance and diminished generalization. To address this problem, we design ModalChorus, an interactive system for visual probing and alignment of multi-modal embeddings. ModalChorus primarily offers a two-stage process: 1) embedding probing with Modal Fusion Map (MFM), a novel parametric dimensionality reduction method that integrates both metric and nonmetric objectives to enhance modality fusion; and 2) embedding alignment that allows users to interactively articulate intentions for both point-set and set-set alignments. Quantitative and qualitative comparisons for CLIP embeddings with existing dimensionality reduction (e.g., t-SNE and MDS) and data fusion (e.g., data context map) methods demonstrate the advantages of MFM in showcasing cross-modal features over common vision-language datasets. Case studies reveal that ModalChorus can facilitate intuitive discovery of misalignment and efficient re-alignment in scenarios ranging from zero-shot classification to cross-modal retrieval and generation.

ModalChorus: Visual Probing and Alignment of Multi-modal Embeddings via Modal Fusion Map

TL;DR

ModalChorus is an interactive system for visual probing and alignment of multi-modal embeddings that can facilitate intuitive discovery of misalignment and efficient re-alignment in scenarios ranging from zero-shot classification to cross-modal retrieval and generation.

Abstract

Multi-modal embeddings form the foundation for vision-language models, such as CLIP embeddings, the most widely used text-image embeddings. However, these embeddings are vulnerable to subtle misalignment of cross-modal features, resulting in decreased model performance and diminished generalization. To address this problem, we design ModalChorus, an interactive system for visual probing and alignment of multi-modal embeddings. ModalChorus primarily offers a two-stage process: 1) embedding probing with Modal Fusion Map (MFM), a novel parametric dimensionality reduction method that integrates both metric and nonmetric objectives to enhance modality fusion; and 2) embedding alignment that allows users to interactively articulate intentions for both point-set and set-set alignments. Quantitative and qualitative comparisons for CLIP embeddings with existing dimensionality reduction (e.g., t-SNE and MDS) and data fusion (e.g., data context map) methods demonstrate the advantages of MFM in showcasing cross-modal features over common vision-language datasets. Case studies reveal that ModalChorus can facilitate intuitive discovery of misalignment and efficient re-alignment in scenarios ranging from zero-shot classification to cross-modal retrieval and generation.
Paper Structure (19 sections, 6 equations, 10 figures, 2 tables)

This paper contains 19 sections, 6 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Overview of our framework. Multi-modal embeddings and concepts extracted from text and images are first projected with Modal Fusion Map, a novel modality-fusing DR method we propose. In the visual exploration stage, visual probing of the embeddings is enabled in the projection view and the concept axis view, allowing users to explore embedding sets and individual instance point. Finally, in the embedding alignment stage, interactive alignment with point-set and set-set alignment schemes is provided, along with optional augmentation with few-shot samples.
  • Figure 2: (a) Data Context Map (DCM) only considers metric-based optimization that indiscriminately seeks to preserve the absolute distance for intra-modal and inter-modal pairs of data points. (b) Inspired by the observation that the nonmetric rank order of cross-modal distances is important for multi-modal embedding-based tasks, our Modal Fusion Map (MFM) combines metric and nonmetric objectives for fusion.
  • Figure 3: For the zero-shot classification task on CIFAR-10 which relies on cross-modal similarity (color of points represent the predicted class), MFM can better reflect set relations and outliers for visual probing of misalignment.
  • Figure 4: DCM insufficiently captures the rank order of cross-modal distances between text and image embeddings, resulting in rather even distribution of image embedding points around the concept text embeddings, making it harder to observe differences in distribution pattern.
  • Figure 5: ModalChorus system. (a) Settings panel on the left allow users' choice of task and dataset. The main projection view (b) displays the MFM dimension reduction result of embeddings. The concept axis view (d) supports axis-based exploration, while the augmentation panel (e) facilitates uploading, generating, and tagging additional data for alignment.
  • ...and 5 more figures