Table of Contents
Fetching ...

Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation

Chang Che, Qunwei Lin, Xinyu Zhao, Jiaxin Huang, Liqiang Yu

TL;DR

The paper addresses the challenge of translating visual inputs into textual explanations by proposing a CLIP-based ensemble for image-to-text transformation. It introduces Model A with enhanced visual-to-text embedding and Model B with zero-shot CLIP embeddings fused via KNN, then combines them into an Ensemble. Data preprocessing via augmentation and cosine-sim filtering improves data quality and training efficiency, reducing data size and training time. The results show the ensemble achieves state-of-the-art alignment with ground-truth text, surpassing standalone CLIP models, as evidenced by an Avg-CosSim of $0.5961$ on the evaluation, demonstrating the practical value for captioning and image retrieval tasks.

Abstract

The process of transforming input images into corresponding textual explanations stands as a crucial and complex endeavor within the domains of computer vision and natural language processing. In this paper, we propose an innovative ensemble approach that harnesses the capabilities of Contrastive Language-Image Pretraining models.

Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation

TL;DR

The paper addresses the challenge of translating visual inputs into textual explanations by proposing a CLIP-based ensemble for image-to-text transformation. It introduces Model A with enhanced visual-to-text embedding and Model B with zero-shot CLIP embeddings fused via KNN, then combines them into an Ensemble. Data preprocessing via augmentation and cosine-sim filtering improves data quality and training efficiency, reducing data size and training time. The results show the ensemble achieves state-of-the-art alignment with ground-truth text, surpassing standalone CLIP models, as evidenced by an Avg-CosSim of on the evaluation, demonstrating the practical value for captioning and image retrieval tasks.

Abstract

The process of transforming input images into corresponding textual explanations stands as a crucial and complex endeavor within the domains of computer vision and natural language processing. In this paper, we propose an innovative ensemble approach that harnesses the capabilities of Contrastive Language-Image Pretraining models.
Paper Structure (12 sections, 8 equations, 4 figures, 1 table)

This paper contains 12 sections, 8 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Architecture of the CLIP Model
  • Figure 2: Architecture of Model A - Enhanced Image-to-Text Transformation
  • Figure 3: Zero-Shot Learning Architecture in Model B
  • Figure 4: K-Nearest Neighbors Fusion in Model B