Multi-method Integration with Confidence-based Weighting for Zero-shot Image Classification
Siqi Yin, Lifan Jiang
TL;DR
The paper tackles zero-shot image classification by integrating multiple models and generation tools to overcome information bottlenecks for unseen categories. It generates boundary-focused reference images via ChatGPT and DALL-E, aligns images and text with CLIP, augments with DINO-based image-image alignment, and fuses predictions using confidence-based weights. The method achieves strong results on CIFAR-10/100 and TinyImageNet, with AUROC consistently above 96% and CIFAR-10 surpassing 99%, demonstrating improved generalization to open-set categories. This multi-model, boundary-aware fusion approach offers a practical path toward robust zero-shot and open-set recognition in diverse visual domains.
Abstract
This paper introduces a novel framework for zero-shot learning (ZSL), i.e., to recognize new categories that are unseen during training, by using a multi-model and multi-alignment integration method. Specifically, we propose three strategies to enhance the model's performance to handle ZSL: 1) Utilizing the extensive knowledge of ChatGPT and the powerful image generation capabilities of DALL-E to create reference images that can precisely describe unseen categories and classification boundaries, thereby alleviating the information bottleneck issue; 2) Integrating the results of text-image alignment and image-image alignment from CLIP, along with the image-image alignment results from DINO, to achieve more accurate predictions; 3) Introducing an adaptive weighting mechanism based on confidence levels to aggregate the outcomes from different prediction methods. Experimental results on multiple datasets, including CIFAR-10, CIFAR-100, and TinyImageNet, demonstrate that our model can significantly improve classification accuracy compared to single-model approaches, achieving AUROC scores above 96% across all test datasets, and notably surpassing 99% on the CIFAR-10 dataset.
