Table of Contents
Fetching ...

NeighborRetr: Balancing Hub Centrality in Cross-Modal Retrieval

Zengrong Lin, Zheng Wang, Tianwen Qian, Pan Mu, Sixian Chan, Cong Bai

TL;DR

Hubness remains a challenge in cross-modal retrieval, biasing nearest-neighbor relations even for strong alignment models. NeighborRetr mitigates hubness during training by estimating sample centrality, weighting hub learning, balancing neighborhood relations, and enforcing uniform retrieval, integrated within a two-level visual-text learning pipeline. It introduces three losses—centrality weighting, neighbor adjusting, and uniform regularization—along with a KL-term for stability, achieving state-of-the-art results on four text-video and three text-image benchmarks and demonstrating robust cross-domain generalization. The work provides empirical evidence that training-time hubness mitigation improves both accuracy and fairness in cross-modal retrieval and releases code for reproducibility.

Abstract

Cross-modal retrieval aims to bridge the semantic gap between different modalities, such as visual and textual data, enabling accurate retrieval across them. Despite significant advancements with models like CLIP that align cross-modal representations, a persistent challenge remains: the hubness problem, where a small subset of samples (hubs) dominate as nearest neighbors, leading to biased representations and degraded retrieval accuracy. Existing methods often mitigate hubness through post-hoc normalization techniques, relying on prior data distributions that may not be practical in real-world scenarios. In this paper, we directly mitigate hubness during training and introduce NeighborRetr, a novel method that effectively balances the learning of hubs and adaptively adjusts the relations of various kinds of neighbors. Our approach not only mitigates the hubness problem but also enhances retrieval performance, achieving state-of-the-art results on multiple cross-modal retrieval benchmarks. Furthermore, NeighborRetr demonstrates robust generalization to new domains with substantial distribution shifts, highlighting its effectiveness in real-world applications. We make our code publicly available at: https://github.com/zzezze/NeighborRetr .

NeighborRetr: Balancing Hub Centrality in Cross-Modal Retrieval

TL;DR

Hubness remains a challenge in cross-modal retrieval, biasing nearest-neighbor relations even for strong alignment models. NeighborRetr mitigates hubness during training by estimating sample centrality, weighting hub learning, balancing neighborhood relations, and enforcing uniform retrieval, integrated within a two-level visual-text learning pipeline. It introduces three losses—centrality weighting, neighbor adjusting, and uniform regularization—along with a KL-term for stability, achieving state-of-the-art results on four text-video and three text-image benchmarks and demonstrating robust cross-domain generalization. The work provides empirical evidence that training-time hubness mitigation improves both accuracy and fairness in cross-modal retrieval and releases code for reproducibility.

Abstract

Cross-modal retrieval aims to bridge the semantic gap between different modalities, such as visual and textual data, enabling accurate retrieval across them. Despite significant advancements with models like CLIP that align cross-modal representations, a persistent challenge remains: the hubness problem, where a small subset of samples (hubs) dominate as nearest neighbors, leading to biased representations and degraded retrieval accuracy. Existing methods often mitigate hubness through post-hoc normalization techniques, relying on prior data distributions that may not be practical in real-world scenarios. In this paper, we directly mitigate hubness during training and introduce NeighborRetr, a novel method that effectively balances the learning of hubs and adaptively adjusts the relations of various kinds of neighbors. Our approach not only mitigates the hubness problem but also enhances retrieval performance, achieving state-of-the-art results on multiple cross-modal retrieval benchmarks. Furthermore, NeighborRetr demonstrates robust generalization to new domains with substantial distribution shifts, highlighting its effectiveness in real-world applications. We make our code publicly available at: https://github.com/zzezze/NeighborRetr .

Paper Structure

This paper contains 29 sections, 28 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Left:Hubness Balancing. In the original embedding space (top left), bad hubs such as $g_3$ dominate the neighborhood, leading to unsatisfying retrieval performance. NeighborRetr rebalances the neighborhood relations (bottom left) by adaptively bringing good hubs closer, such as $g_2$, effectively mitigating the hubness problem. Right:Good/Bad Neighbors Identification. NeighborRetr identifies cross-modality hubs based on sample centrality with Memory bank (MB), and distinguishes between good neighbors (GN) and bad neighbors (BN) by considering both the neighbor similarity and the sample centrality.
  • Figure 2: Distribution of k-occurrence frequency among all textual similarities for (a) vanilla CLIP and (b) CLIP fintuned with NeighborRetr. The x-axis represents how often an image appears in queries’ top-15 nearest neighbors. The y-axis counts the number of images at each frequency. Blue bars indicate all neighbors, orange bars highlight positive pairs among neighbors, and the green line represents the ground truth distribution.
  • Figure 3: Overview of NeighborRetr on two-level hierarchy framework. At the low level, $\mathcal{L}_{\mathrm{Wti}}$ emphasizes the centrality of hubs and $\mathcal{L}_{\mathrm{Nbi}}$ balances neighborhood relations via comparing a memory bank. After token merging, $\mathcal{L}_{\mathrm{Opt}}$ enforces equal retrieval probabilities for overall matching at the high level.
  • Figure 4: Top: Distributions of cross-modal similarity of MSR-VTT and ActivityNet. Bottom: Cross-domain adaptation on MSRVTT$\rightarrow$ActivityNet. The radius denotes hub occurrence radovanovic2010hubs.
  • Figure 5: Retrieval performance on MSR-VTT: The y-axis shows the geometric mean of R@{1,5,10}. Left: Ablation study on the size of neighbors; the x-axis represents the size of neighbors. Right: Ablation study on the size of the Memory Bank during training; the x-axis represents the Memory Bank size.
  • ...and 2 more figures