Google is all you need: Semi-Supervised Transfer Learning Strategy For Light Multimodal Multi-Task Classification Model
Haixu Liu, Penghao Jiang, Zerui Tao
TL;DR
The paper tackles multi-label image classification by integrating visual and textual information through a lightweight, three-module framework (vision, NLP, and fusion). It selects EfficientNet_b4 for images and BERT Tiny for text, employing a fusion module that combines logits via simple fusion and a cross-attention pathway with residuals, enhanced by semi-supervised training and data augmentation. Key contributions include a log-based class-imbalance weighting scheme, an improved label-assignment strategy to avoid empty predictions, and a rigorous ablation program to validate fusion techniques. The resulting system achieves a high Kaggle validation score (0.96641) and demonstrates strong generalization under resource-constrained conditions, highlighting the practicality of multimodal transfer learning for scalable image labeling.
Abstract
As the volume of digital image data increases, the effectiveness of image classification intensifies. This study introduces a robust multi-label classification system designed to assign multiple labels to a single image, addressing the complexity of images that may be associated with multiple categories (ranging from 1 to 19, excluding 12). We propose a multi-modal classifier that merges advanced image recognition algorithms with Natural Language Processing (NLP) models, incorporating a fusion module to integrate these distinct modalities. The purpose of integrating textual data is to enhance the accuracy of label prediction by providing contextual understanding that visual analysis alone cannot fully capture. Our proposed classification model combines Convolutional Neural Networks (CNN) for image processing with NLP techniques for analyzing textual description (i.e., captions). This approach includes rigorous training and validation phases, with each model component verified and analyzed through ablation experiments. Preliminary results demonstrate the classifier's accuracy and efficiency, highlighting its potential as an automatic image-labeling system.
