Table of Contents
Fetching ...

UniRank: End-to-End Domain-Specific Reranking of Hybrid Text-Image Candidates

Yupei Yang, Lin Yang, Wanxi Deng, Lin Qu, Shikui Tu, Lei Xu

Abstract

Reranking is a critical component in many information retrieval pipelines. Despite remarkable progress in text-only settings, multimodal reranking remains challenging, particularly when the candidate set contains hybrid text and image items. A key difficulty is the modality gap: a text reranker is intrinsically closer to text candidates than to image candidates, leading to biased and suboptimal cross-modal ranking. Vision-language models (VLMs) mitigate this gap through strong cross-modal alignment and have recently been adopted to build multimodal rerankers. However, most VLM-based rerankers encode all candidates as images, and treating text as images introduces substantial computational overhead. Meanwhile, existing open-source multimodal rerankers are typically trained on general-domain data and often underperform in domain-specific scenarios. To address these limitations, we propose UniRank, a VLM-based reranking framework that natively scores and orders hybrid text-image candidates without any modality conversion. Building on this hybrid scoring interface, UniRank provides an end-to-end domain adaptation pipeline that includes: (1) an instruction-tuning stage that learns calibrated cross-modal relevance scoring by mapping label-token likelihoods to a unified scalar score; and (2) a hard-negative-driven preference alignment stage that constructs in-domain pairwise preferences and performs query-level policy optimization through reinforcement learning from human feedback (RLHF). Extensive experiments on scientific literature retrieval and design patent search demonstrate that UniRank consistently outperforms state-of-the-art baselines, improving Recall@1 by 8.9% and 7.3%, respectively.

UniRank: End-to-End Domain-Specific Reranking of Hybrid Text-Image Candidates

Abstract

Reranking is a critical component in many information retrieval pipelines. Despite remarkable progress in text-only settings, multimodal reranking remains challenging, particularly when the candidate set contains hybrid text and image items. A key difficulty is the modality gap: a text reranker is intrinsically closer to text candidates than to image candidates, leading to biased and suboptimal cross-modal ranking. Vision-language models (VLMs) mitigate this gap through strong cross-modal alignment and have recently been adopted to build multimodal rerankers. However, most VLM-based rerankers encode all candidates as images, and treating text as images introduces substantial computational overhead. Meanwhile, existing open-source multimodal rerankers are typically trained on general-domain data and often underperform in domain-specific scenarios. To address these limitations, we propose UniRank, a VLM-based reranking framework that natively scores and orders hybrid text-image candidates without any modality conversion. Building on this hybrid scoring interface, UniRank provides an end-to-end domain adaptation pipeline that includes: (1) an instruction-tuning stage that learns calibrated cross-modal relevance scoring by mapping label-token likelihoods to a unified scalar score; and (2) a hard-negative-driven preference alignment stage that constructs in-domain pairwise preferences and performs query-level policy optimization through reinforcement learning from human feedback (RLHF). Extensive experiments on scientific literature retrieval and design patent search demonstrate that UniRank consistently outperforms state-of-the-art baselines, improving Recall@1 by 8.9% and 7.3%, respectively.

Paper Structure

This paper contains 68 sections, 24 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Overview of UniRank. We first perform instruction-driven SFT to obtain a VLM-based reranker. We then mine hard negatives to construct pairwise preferences, train a reward model, and further align the reranker via RLHF for domain-specific hybrid text-image reranking.
  • Figure 2: Example from MMDocIR Academic Paper (layout-retrieval). Given a text-only question, the retriever/reranker selects the most relevant layout regions (e.g., paragraph/figure/table) within a long scientific document.
  • Figure 3: Example from the design patent search task. Given a target product as the query (image + text), the reranker scores and orders multiple recalled design patents (candidates) by infringement risk.
  • Figure 4: Illustrative SFT and preference-data examples for scientific literature retrieval. The user message contains a question and one candidate layout region (text/table/figure). Images are provided via the VLM vision input channel.
  • Figure 5: Illustrative SFT and preference-data examples for design patent search. The user message contains one target product and one candidate patent with both images and structured text fields.