Table of Contents
Fetching ...

RWKV-CLIP: A Robust Vision-Language Representation Learner

Tiancheng Gu, Kaicheng Yang, Xiang An, Ziyong Feng, Dongnan Liu, Weidong Cai, Jiankang Deng

TL;DR

RWKV-CLIP is proposed, the first RWKV-driven vision-language representation learning model that combines the effective parallel training of transformers with the efficient inference of RNNs and it achieves state-of-the-art performance across multiple downstream tasks, including linear probing, zero-shot classification, and zero-shot image-text retrieval.

Abstract

Contrastive Language-Image Pre-training (CLIP) has significantly improved performance in various vision-language tasks by expanding the dataset with image-text pairs obtained from websites. This paper further explores CLIP from the perspectives of data and model architecture. To address the prevalence of noisy data and enhance the quality of large-scale image-text data crawled from the internet, we introduce a diverse description generation framework that can leverage Large Language Models (LLMs) to synthesize and refine content from web-based texts, synthetic captions, and detection tags. Furthermore, we propose RWKV-CLIP, the first RWKV-driven vision-language representation learning model that combines the effective parallel training of transformers with the efficient inference of RNNs. Comprehensive experiments across various model scales and pre-training datasets demonstrate that RWKV-CLIP is a robust and efficient vision-language representation learner, it achieves state-of-the-art performance in several downstream tasks, including linear probe, zero-shot classification, and zero-shot image-text retrieval. To facilitate future research, the code and pre-trained models are released at https://github.com/deepglint/RWKV-CLIP

RWKV-CLIP: A Robust Vision-Language Representation Learner

TL;DR

RWKV-CLIP is proposed, the first RWKV-driven vision-language representation learning model that combines the effective parallel training of transformers with the efficient inference of RNNs and it achieves state-of-the-art performance across multiple downstream tasks, including linear probing, zero-shot classification, and zero-shot image-text retrieval.

Abstract

Contrastive Language-Image Pre-training (CLIP) has significantly improved performance in various vision-language tasks by expanding the dataset with image-text pairs obtained from websites. This paper further explores CLIP from the perspectives of data and model architecture. To address the prevalence of noisy data and enhance the quality of large-scale image-text data crawled from the internet, we introduce a diverse description generation framework that can leverage Large Language Models (LLMs) to synthesize and refine content from web-based texts, synthetic captions, and detection tags. Furthermore, we propose RWKV-CLIP, the first RWKV-driven vision-language representation learning model that combines the effective parallel training of transformers with the efficient inference of RNNs. Comprehensive experiments across various model scales and pre-training datasets demonstrate that RWKV-CLIP is a robust and efficient vision-language representation learner, it achieves state-of-the-art performance in several downstream tasks, including linear probe, zero-shot classification, and zero-shot image-text retrieval. To facilitate future research, the code and pre-trained models are released at https://github.com/deepglint/RWKV-CLIP
Paper Structure (27 sections, 10 equations, 10 figures, 13 tables)

This paper contains 27 sections, 10 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: The proposed RWKV-CLIP combines the effective parallel training of transformers with the efficient inference of RNNs, achieving better efficiency and accuracy than the baseline methods (e.g., CLIP and ALIP).
  • Figure 2: The architecture of our proposed diverse description generation framework.
  • Figure 3: The architecture of RWKV-CLIP, which consists of M$\times$ and N$\times$ RWKV-driven blocks followed by an average pooling layer.
  • Figure 4: Comparison of our proposed diverse description generation framework vs. CapsFusion. Hallucinations are highlighted in red, and additional semantic information is highlighted in green.
  • Figure 5: Linear probe performance comparison between RWKV-CLIP and ALIP on 26 downstream datasets. The comparisons include RWKV-CLIP-B/32 vs. ALIP-ViT-B/32 on LAION10M, RWKV-CLIP-B/16 vs. ALIP-ViT-B/16 on LAION10M, and RWKV-CLIP-B/32 vs. ALIP-ViT-B/32 on LAION30M, presented from left to right.
  • ...and 5 more figures