Multimodal Multilabel Classification by CLIP

Yanming Guo

Multimodal Multilabel Classification by CLIP

Yanming Guo

TL;DR

A novel technique is leveraged that utilises the Contrastive Language-Image Pre-training (CLIP) as the feature extractor and fine-tune the model by exploring different classification heads, fusion methods and loss functions.

Abstract

Multimodal multilabel classification (MMC) is a challenging task that aims to design a learning algorithm to handle two data sources, the image and text, and learn a comprehensive semantic feature presentation across the modalities. In this task, we review the extensive number of state-of-the-art approaches in MMC and leverage a novel technique that utilises the Contrastive Language-Image Pre-training (CLIP) as the feature extractor and fine-tune the model by exploring different classification heads, fusion methods and loss functions. Finally, our best result achieved more than 90% F_1 score in the public Kaggle competition leaderboard. This paper provides detailed descriptions of novel training methods and quantitative analysis through the experimental results.

Multimodal Multilabel Classification by CLIP

TL;DR

Abstract

Paper Structure (28 sections, 10 equations, 10 figures, 12 tables)

This paper contains 28 sections, 10 equations, 10 figures, 12 tables.

Introduction
Related Works
Multimodal Learning
Backbones Networks
Method
Feature Extraction
Fusion
Loss Function
Exponential Moving Average (EMA)
Classification Head
Experiments and Results
Experiment Setups
Evaluation Metric
Hyperparameter Initialization
Comparison Methods
...and 13 more sections

Figures (10)

Figure 1: Four classes of the proposed methods according to the size of each module Kim2021ViLTVT.
Figure 2: Structure of the CLIP network radford2021learning.
Figure 3: Structure of the ViLT network Kim2021ViLTVT.
Figure 4: Generalization ability of ResNet-101 vs. Zero-Shot CLIP radford2021learning.
Figure 5: Strucuture of gMLP liu2021pay.
...and 5 more figures

Multimodal Multilabel Classification by CLIP

TL;DR

Abstract

Multimodal Multilabel Classification by CLIP

Authors

TL;DR

Abstract

Table of Contents

Figures (10)