Table of Contents
Fetching ...

EFSA: Episodic Few-Shot Adaptation for Text-to-Image Retrieval

Muhammad Huzaifa, Yova Kementchedjhieva

TL;DR

This work tackles open-domain text-to-image retrieval, where standard zero-shot or single-domain fine-tuning struggles with hard negatives. It introduces EFSA, a test-time episodic adaptation framework that fine-tunes a vision-language model on top-$k$ retrieved candidates and their synthetic captions using LoRA, then re-ranks the pool for each query while resetting after each inference. Across eight diverse domains and a large open-domain pool, EFSA achieves consistent improvements in Recall@1 and demonstrates robustness to domain shifts, outperforming strong baselines including fine-tuning and RLCF. The approach offers a practical, compute-friendly path to robust cross-domain T2I retrieval with modest storage overhead and strong generalization, highlighting the value of episodic few-shot adaptation for open-domain multimodal tasks.

Abstract

Text-to-image retrieval is a critical task for managing diverse visual content, but common benchmarks for the task rely on small, single-domain datasets that fail to capture real-world complexity. Pre-trained vision-language models tend to perform well with easy negatives but struggle with hard negatives--visually similar yet incorrect images--especially in open-domain scenarios. To address this, we introduce Episodic Few-Shot Adaptation (EFSA), a novel test-time framework that adapts pre-trained models dynamically to a query's domain by fine-tuning on top-k retrieved candidates and synthetic captions generated for them. EFSA improves performance across diverse domains while preserving generalization, as shown in evaluations on queries from eight highly distinct visual domains and an open-domain retrieval pool of over one million images. Our work highlights the potential of episodic few-shot adaptation to enhance robustness in the critical and understudied task of open-domain text-to-image retrieval.

EFSA: Episodic Few-Shot Adaptation for Text-to-Image Retrieval

TL;DR

This work tackles open-domain text-to-image retrieval, where standard zero-shot or single-domain fine-tuning struggles with hard negatives. It introduces EFSA, a test-time episodic adaptation framework that fine-tunes a vision-language model on top- retrieved candidates and their synthetic captions using LoRA, then re-ranks the pool for each query while resetting after each inference. Across eight diverse domains and a large open-domain pool, EFSA achieves consistent improvements in Recall@1 and demonstrates robustness to domain shifts, outperforming strong baselines including fine-tuning and RLCF. The approach offers a practical, compute-friendly path to robust cross-domain T2I retrieval with modest storage overhead and strong generalization, highlighting the value of episodic few-shot adaptation for open-domain multimodal tasks.

Abstract

Text-to-image retrieval is a critical task for managing diverse visual content, but common benchmarks for the task rely on small, single-domain datasets that fail to capture real-world complexity. Pre-trained vision-language models tend to perform well with easy negatives but struggle with hard negatives--visually similar yet incorrect images--especially in open-domain scenarios. To address this, we introduce Episodic Few-Shot Adaptation (EFSA), a novel test-time framework that adapts pre-trained models dynamically to a query's domain by fine-tuning on top-k retrieved candidates and synthetic captions generated for them. EFSA improves performance across diverse domains while preserving generalization, as shown in evaluations on queries from eight highly distinct visual domains and an open-domain retrieval pool of over one million images. Our work highlights the potential of episodic few-shot adaptation to enhance robustness in the critical and understudied task of open-domain text-to-image retrieval.

Paper Structure

This paper contains 29 sections, 6 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Zero-shot text-to-image retrieval with CLIP exhibits a sharp drop in Recall@1 compared to Recall@5 and 10. For the three example queries on top, CLIP ranks an incorrect image (red frame) as the highest, which is highly similar to the ground truth image (green frame). Recall@1 suffers due to such hard negatives.
  • Figure 2: Our method, Episodic Few-Shot Adaptation, works by first retrieving the top-$k$ most similar images from a diverse, open-domain image pool. It then finetunes both the image and text encoder on these top-$k$ images and synthetic captions generated for them. Finally, the updated encoders are used to re-rank the top-$k$ images, bringing more correct candidates to the high ranks.
  • Figure 3: Qualitative comparison of EFSA and zero-shot CLIP. Ground-truth images (highlighted in green) correspond to the text queries shown above each row. EFSA successfully re-ranks the ground-truth images to the first rank. See Supplementary Figure \ref{['fig:artcap_textcap_sup']} for more detailed examples with generated captions.
  • Figure 4: Comparison of LoRA parameter tuning versus tuning all model parameters across multiple datasets
  • Figure 6: Effects of various caption generation prompts.
  • ...and 6 more figures