From Latent to Engine Manifolds: Analyzing ImageBind's Multimodal Embedding Space

Andrew Hamara; Pablo Rivas

From Latent to Engine Manifolds: Analyzing ImageBind's Multimodal Embedding Space

Andrew Hamara, Pablo Rivas

TL;DR

The initial findings with ImageBind's emergent zero-shot cross-modal retrieval suggest that pure audio embeddings can correlate with semantically similar marketplace listings, indicating potential avenues for future research.

Abstract

This study investigates ImageBind's ability to generate meaningful fused multimodal embeddings for online auto parts listings. We propose a simplistic embedding fusion workflow that aims to capture the overlapping information of image/text pairs, ultimately combining the semantics of a post into a joint embedding. After storing such fused embeddings in a vector database, we experiment with dimensionality reduction and provide empirical evidence to convey the semantic quality of the joint embeddings by clustering and examining the posts nearest to each cluster centroid. Additionally, our initial findings with ImageBind's emergent zero-shot cross-modal retrieval suggest that pure audio embeddings can correlate with semantically similar marketplace listings, indicating potential avenues for future research.

From Latent to Engine Manifolds: Analyzing ImageBind's Multimodal Embedding Space

TL;DR

Abstract

Paper Structure (14 sections, 2 equations, 5 figures, 1 table)

This paper contains 14 sections, 2 equations, 5 figures, 1 table.

Introduction
ImageBind Overview
Modalities
Encoders
Training
Methodology
Embedding Fusion
Clustering
Results and Discussion
Cluster Analysis
Cross-Modal Retrieval
Conclusions
Acknowledgments.
Disclosure of Interests.

Figures (5)

Figure 1: Example of a vehicle posted for sale online. It has a textual description and an image out of many.
Figure 2: A "frustratingly" frustratingly_simple simple workflow for creating joint, multimodal embeddings.
Figure 3: UMAP visualization of the high-dimensional embeddings, with clusters colorized based on 32-dimensional $k$-means results.
Figure 4: Images of posts near their respective $k$-means centroid, grouped by cluster.
Figure 5: Retrieval of fused listing embeddings via semantically similar audio embeddings.

From Latent to Engine Manifolds: Analyzing ImageBind's Multimodal Embedding Space

TL;DR

Abstract

From Latent to Engine Manifolds: Analyzing ImageBind's Multimodal Embedding Space

Authors

TL;DR

Abstract

Table of Contents

Figures (5)