Table of Contents
Fetching ...

CosmoCLIP: Generalizing Large Vision-Language Models for Astronomical Imaging

Raza Imam, Mohammed Talha Alam, Umaima Rahman, Mohsen Guizani, Fakhri Karray

TL;DR

CosmoCLIP, an astronomical image-text contrastive learning framework precisely fine-tuned on the pre-trained CLIP model using SpaceNet and BLIP-based captions, is introduced, significantly outperforming CLIP in zero-shot classification and image-text retrieval tasks.

Abstract

Existing vision-text contrastive learning models enhance representation transferability and support zero-shot prediction by matching paired image and caption embeddings while pushing unrelated pairs apart. However, astronomical image-label datasets are significantly smaller compared to general image and label datasets available from the internet. We introduce CosmoCLIP, an astronomical image-text contrastive learning framework precisely fine-tuned on the pre-trained CLIP model using SpaceNet and BLIP-based captions. SpaceNet, attained via FLARE, constitutes ~13k optimally distributed images, while BLIP acts as a rich knowledge extractor. The rich semantics derived from this SpaceNet and BLIP descriptions, when learned contrastively, enable CosmoCLIP to achieve superior generalization across various in-domain and out-of-domain tasks. Our results demonstrate that CosmoCLIP is a straightforward yet powerful framework, significantly outperforming CLIP in zero-shot classification and image-text retrieval tasks.

CosmoCLIP: Generalizing Large Vision-Language Models for Astronomical Imaging

TL;DR

CosmoCLIP, an astronomical image-text contrastive learning framework precisely fine-tuned on the pre-trained CLIP model using SpaceNet and BLIP-based captions, is introduced, significantly outperforming CLIP in zero-shot classification and image-text retrieval tasks.

Abstract

Existing vision-text contrastive learning models enhance representation transferability and support zero-shot prediction by matching paired image and caption embeddings while pushing unrelated pairs apart. However, astronomical image-label datasets are significantly smaller compared to general image and label datasets available from the internet. We introduce CosmoCLIP, an astronomical image-text contrastive learning framework precisely fine-tuned on the pre-trained CLIP model using SpaceNet and BLIP-based captions. SpaceNet, attained via FLARE, constitutes ~13k optimally distributed images, while BLIP acts as a rich knowledge extractor. The rich semantics derived from this SpaceNet and BLIP descriptions, when learned contrastively, enable CosmoCLIP to achieve superior generalization across various in-domain and out-of-domain tasks. Our results demonstrate that CosmoCLIP is a straightforward yet powerful framework, significantly outperforming CLIP in zero-shot classification and image-text retrieval tasks.
Paper Structure (16 sections, 8 equations, 7 figures, 3 tables)

This paper contains 16 sections, 8 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Overview of proposed framework: CosmoCLIP. (Left) For fine-tuning, inputs images from SpaceNet $X_{img}$ are processed via pre-trained CLIP image encoder to achieve image embeddings (1). BLIP $G_{caption}$ acts as a knowledge extractor, inputting SpaceNet images $X_{img}$ and generating descriptive captions $L_{img}$, which then act as image-text pairs for similarity training (3). (Right) Given an input text or image prompt, zero-shot prediction or image-text retrieval is performed.
  • Figure : (a) CLIP radford2021learning
  • Figure : (a) Text-to-Image Retrieval
  • Figure : (a) CLIP radford2021learning
  • Figure : (b) CosmoCLIP (Ours)
  • ...and 2 more figures