Table of Contents
Fetching ...

Ranking-aware adapter for text-driven image ordering with CLIP

Wei-Hsiang Yu, Yen-Yu Lin, Ming-Hsuan Yang, Yi-Hsuan Tsai

TL;DR

This work tackles text-guided ranking of multiple images by reframing CLIP as a learning-to-rank task and introducing a lightweight ranking-aware adapter. A cross-attention-based adapter generates text-conditioned visual embeddings and a relational attention module learns pairwise visual differences to predict ranking scores. The approach combines a regression objective with a pairwise ranking loss and demonstrates strong results across facial age estimation, historical dating, image quality, and object counting without task-specific pretraining. The method offers a general and scalable solution for multi-image ranking using a single, compact extension to CLIP, with potential impact on retrieval and QA systems that require quantitative comparisons. Overall, the framework highlights the feasibility and value of integrating vision-language models with ranking objectives for flexible, text-driven image ordering.

Abstract

Recent advances in vision-language models (VLMs) have made significant progress in downstream tasks that require quantitative concepts such as facial age estimation and image quality assessment, enabling VLMs to explore applications like image ranking and retrieval. However, existing studies typically focus on the reasoning based on a single image and heavily depend on text prompting, limiting their ability to learn comprehensive understanding from multiple images. To address this, we propose an effective yet efficient approach that reframes the CLIP model into a learning-to-rank task and introduces a lightweight adapter to augment CLIP for text-guided image ranking. Specifically, our approach incorporates learnable prompts to adapt to new instructions for ranking purposes and an auxiliary branch with ranking-aware attention, leveraging text-conditioned visual differences for additional supervision in image ranking. Our ranking-aware adapter consistently outperforms fine-tuned CLIPs on various tasks and achieves competitive results compared to state-of-the-art models designed for specific tasks like facial age estimation and image quality assessment. Overall, our approach primarily focuses on ranking images with a single instruction, which provides a natural and generalized way of learning from visual differences across images, bypassing the need for extensive text prompts tailored to individual tasks. Code is available: github.com/uynaes/RankingAwareCLIP.

Ranking-aware adapter for text-driven image ordering with CLIP

TL;DR

This work tackles text-guided ranking of multiple images by reframing CLIP as a learning-to-rank task and introducing a lightweight ranking-aware adapter. A cross-attention-based adapter generates text-conditioned visual embeddings and a relational attention module learns pairwise visual differences to predict ranking scores. The approach combines a regression objective with a pairwise ranking loss and demonstrates strong results across facial age estimation, historical dating, image quality, and object counting without task-specific pretraining. The method offers a general and scalable solution for multi-image ranking using a single, compact extension to CLIP, with potential impact on retrieval and QA systems that require quantitative comparisons. Overall, the framework highlights the feasibility and value of integrating vision-language models with ranking objectives for flexible, text-driven image ordering.

Abstract

Recent advances in vision-language models (VLMs) have made significant progress in downstream tasks that require quantitative concepts such as facial age estimation and image quality assessment, enabling VLMs to explore applications like image ranking and retrieval. However, existing studies typically focus on the reasoning based on a single image and heavily depend on text prompting, limiting their ability to learn comprehensive understanding from multiple images. To address this, we propose an effective yet efficient approach that reframes the CLIP model into a learning-to-rank task and introduces a lightweight adapter to augment CLIP for text-guided image ranking. Specifically, our approach incorporates learnable prompts to adapt to new instructions for ranking purposes and an auxiliary branch with ranking-aware attention, leveraging text-conditioned visual differences for additional supervision in image ranking. Our ranking-aware adapter consistently outperforms fine-tuned CLIPs on various tasks and achieves competitive results compared to state-of-the-art models designed for specific tasks like facial age estimation and image quality assessment. Overall, our approach primarily focuses on ranking images with a single instruction, which provides a natural and generalized way of learning from visual differences across images, bypassing the need for extensive text prompts tailored to individual tasks. Code is available: github.com/uynaes/RankingAwareCLIP.

Paper Structure

This paper contains 41 sections, 6 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: Comparisons with prior work. (a) Prior works require generating caption combinations covering numbers of bins from $N_1, N_2, ..., N_{i}$ with the task-related target, such as "cat," paired with each image (e.g., $I_1, I_2$). In addition, text modulation is necessary to map these numerical values into an ordinal latent space for contrastive learning. (b) Our method streamlines the ranking process through a learning-to-rank framework. A pre-trained CLIP model encodes images and a single rank-related prompt. A lightweight, ranking-aware adapter then generates text-conditioned visual embedding pairs and their relational differences. By optimizing the relational differences among pairs, our approach learns the visual relevance to the given text query.
  • Figure 2: Framework overview. We encode the given images and the query caption using the pre-trained CLIP model. The proposed ranking adapter compiles text-conditioned visual embedding $\{z'_{i}\}$ via the transformer cross-attention mechanism, where patch embeddings $\{z_{i}\}$ serve as queries and text embeddings $\{w_{i}\}$ act as key-value pairs. The ranking adapter comprises two heads: the regression and ranking heads. The former predicts image ranking scores using features of individual images. The latter employs a ranking-aware attention mechanism to explore relative feature responses across images upon which these images are ranked.
  • Figure 3: Relational ranking-aware attention. First, the text-conditioned visual embeddings of two images $\{z'_{i}\}$ and $\{z'_{j}\}$ are concatenated to form a pairwise embedding $\{z'_{ij}\}.$ This pairwise embedding is used as the key in the attention mechanism, with relational tokens $\{q\}$ serving as the query. The resulting attention matrix $A$ is split into two parts, $\{A_{i}\}$ and $\{A_{j}\},$ corresponding to the attention assigned to each image. Using these matrices, the attention outputs, $\{O_{i}\}$ and $\{O_{j}\}$ are computed with $\{z'_{i}\}$ and $\{z'_{j}\}$ as values. Finally, the pair's difference is computed through subtraction, averaged over relational tokens, and processed by an FFN to generate the pairwise difference ${O_{i, j}}$.
  • Figure 4: Qualitative examples of our model. We visualize the ranking performance on facial age estimation, dating historical images, and image quality and aesthetics assessment.
  • Figure 5: Qualitative examples on object count sorting and IQA/IAA. The images are sorted from highest to lowest score according to the textual cues. The red cross ($\times$) represents the wrong sorting position in the list. AI artworks are generated by DALL·E 3 BetkerImprovingIG. Our method accurately ranks images from simple to complex compositions with multiple object categories in real photos and artworks. For image sorting based on quality properties, our method outperforms fine-tuned CLIP-IQA (CLIP-IQA+).
  • ...and 7 more figures