Fine-tuning CLIP Text Encoders with Two-step Paraphrasing

Hyunjae Kim; Seunghyun Yoon; Trung Bui; Handong Zhao; Quan Tran; Franck Dernoncourt; Jaewoo Kang

Fine-tuning CLIP Text Encoders with Two-step Paraphrasing

Hyunjae Kim, Seunghyun Yoon, Trung Bui, Handong Zhao, Quan Tran, Franck Dernoncourt, Jaewoo Kang

TL;DR

ParaCLIP tackles CLIP's vulnerability to paraphrase variation by introducing a two-step paraphrase generation pipeline that creates $x_T'$ and $x_T''$ from web captions and fine-tunes only the text encoder with three InfoNCE-based losses: $\\mathcal{L}_{\\mathrm{1}}$, $\\mathcal{L}_{\\mathrm{2}}$, and $\\mathcal{L}_{\\mathrm{3}}$. The final objective, $\\mathcal{L}_{\\mathrm{total}} = \\mathcal{L}_{\\mathrm{1}}(\\mathbf{X}_I, \\mathbf{X}_T^{\\prime\\prime}) + \\mathcal{L}_{\\mathrm{2}}(\\mathbf{X}_T, \\mathbf{X}_T^{\\prime}) + \\mathcal{L}_{\\mathrm{3}}(\\mathbf{X}_T^{\\prime}, \\mathbf{X}_T^{\\prime\\prime})$, preserves pretraining knowledge while aligning paraphrase representations. Empirically, ParaCLIP improves paraphrased retrieval and semantic-text similarity across benchmarks (e.g., increases in AO@10, JS@10, and STS macro-average) and exhibits strong synergy with RoBERTa initializations and LaCLIP-style paraphrase augmentation, though it can slightly degrade some standard vision-language tasks, partly due to batch-size sensitivity. The work demonstrates an efficient, targeted fine-tuning paradigm that enhances robustness to linguistic variability with substantial practical impact for real-world, query-driven vision-language systems.

Abstract

Contrastive language-image pre-training (CLIP) models have demonstrated considerable success across various vision-language tasks, such as text-to-image retrieval, where the model is required to effectively process natural language input to produce an accurate visual output. However, current models still face limitations in dealing with linguistic variations in input queries, such as paraphrases, making it challenging to handle a broad range of user queries in real-world applications. In this study, we introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases. Our approach involves a two-step paraphrase generation process, where we automatically create two categories of paraphrases from web-scale image captions by leveraging large language models. Subsequently, we fine-tune the CLIP text encoder using these generated paraphrases while freezing the image encoder. Our resulting model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks, including paraphrased retrieval (with rank similarity scores improved by up to 2.0% and 5.6%), Visual Genome Relation and Attribution, as well as seven semantic textual similarity tasks.

Fine-tuning CLIP Text Encoders with Two-step Paraphrasing

TL;DR

ParaCLIP tackles CLIP's vulnerability to paraphrase variation by introducing a two-step paraphrase generation pipeline that creates

and

from web captions and fine-tunes only the text encoder with three InfoNCE-based losses:

, and

. The final objective,

, preserves pretraining knowledge while aligning paraphrase representations. Empirically, ParaCLIP improves paraphrased retrieval and semantic-text similarity across benchmarks (e.g., increases in AO@10, JS@10, and STS macro-average) and exhibits strong synergy with RoBERTa initializations and LaCLIP-style paraphrase augmentation, though it can slightly degrade some standard vision-language tasks, partly due to batch-size sensitivity. The work demonstrates an efficient, targeted fine-tuning paradigm that enhances robustness to linguistic variability with substantial practical impact for real-world, query-driven vision-language systems.

Abstract

Paper Structure (22 sections, 2 equations, 3 figures, 2 tables)

This paper contains 22 sections, 2 equations, 3 figures, 2 tables.

Introduction
Method
Paraphrase Generation
Caption-to-paraphrase generation
Paraphrase-to-paraphrase generation
Training Objectives
Experimental Setups
Baseline Models
Evaluation
Results and Discussion
Main Results
Effect of fine-tuning using paraphrases
Effect of initialization with RoBERTa
Comparison with LaCLIP
Lack of compositional understanding
...and 7 more sections

Figures (3)

Figure 1: Image retrieval results of CLIP radford2021learning for two different queries (the gold image is denoted by a bold border). Despite their comparable meanings, the model yields dissimilar retrieval results, highlighting the model's struggle with linguistic variations.
Figure 2: Overview of our two-step paraphrasing process. (1) In caption-to-paraphrase generation, the first paraphrase is generated by removing noise from the original caption and converting it into a more plain language. (2) In paraphrase-to-paraphrase generation, the second paraphrase is generated from the first paraphrase, where the word "reversible" is changed to a semantically similar expression "can be flipped over."
Figure 3: Examples of retrieved images by the CLIP radford2021learning and our ParaCLIP models for two different queries. Note that the queries are obtained from the paraphrased retrieval dataset, and query B is a paraphrase for query A. The gold images are denoted by a bold border.

Fine-tuning CLIP Text Encoders with Two-step Paraphrasing

TL;DR

Abstract

Fine-tuning CLIP Text Encoders with Two-step Paraphrasing

Authors

TL;DR

Abstract

Table of Contents

Figures (3)