Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation

Chang Che; Qunwei Lin; Xinyu Zhao; Jiaxin Huang; Liqiang Yu

Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation

Chang Che, Qunwei Lin, Xinyu Zhao, Jiaxin Huang, Liqiang Yu

TL;DR

The paper addresses the challenge of translating visual inputs into textual explanations by proposing a CLIP-based ensemble for image-to-text transformation. It introduces Model A with enhanced visual-to-text embedding and Model B with zero-shot CLIP embeddings fused via KNN, then combines them into an Ensemble. Data preprocessing via augmentation and cosine-sim filtering improves data quality and training efficiency, reducing data size and training time. The results show the ensemble achieves state-of-the-art alignment with ground-truth text, surpassing standalone CLIP models, as evidenced by an Avg-CosSim of $0.5961$ on the evaluation, demonstrating the practical value for captioning and image retrieval tasks.

Abstract

The process of transforming input images into corresponding textual explanations stands as a crucial and complex endeavor within the domains of computer vision and natural language processing. In this paper, we propose an innovative ensemble approach that harnesses the capabilities of Contrastive Language-Image Pretraining models.

Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation

TL;DR

on the evaluation, demonstrating the practical value for captioning and image retrieval tasks.

Abstract

Paper Structure (12 sections, 8 equations, 4 figures, 1 table)

This paper contains 12 sections, 8 equations, 4 figures, 1 table.

Introduction
Related Work
Algorithm and Model
CLIP Model
Model A - Enhanced Image-to-Text Transformation
Model B - Zero-Shot Learning and KNN-based Fusion
Model Ensemble
Data Preprocessing
Data Augmentation
Cosine Similarity Filtering
Performance Measurement
Experiment Results

Figures (4)

Figure 1: Architecture of the CLIP Model
Figure 2: Architecture of Model A - Enhanced Image-to-Text Transformation
Figure 3: Zero-Shot Learning Architecture in Model B
Figure 4: K-Nearest Neighbors Fusion in Model B

Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation

TL;DR

Abstract

Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)