Table of Contents
Fetching ...

HarmoCLIP: Harmonizing Global and Regional Representations in Contrastive Vision-Language Models

Haoxi Zeng, Haoxuan Li, Yi Bin, Pengpeng Zeng, Xing Xu, Yang Yang, Heng Tao Shen

TL;DR

HarmoCLIP addresses the persistent global–local trade-off in CLIP by introducing direct region-to-text supervision and a Text Token Space, enabling Lexeme-Region Contrastive Learning and Global–Region Alignment. The method preserves the strong global image–text alignment of CLIP while substantially improving region-level semantics, achieving state-of-the-art retrieval with balanced performance and data efficiency. Key contributions include a three-loss framework (GC, LRC, GR), a novel region-to-text supervision pathway, and thorough ablations and visualizations demonstrating improved global coherence and fine-grained perception. The approach is plug-and-play with existing CLIP-like architectures and shows strong potential for downstream tasks requiring open-vocabulary region understanding and robust cross-modal retrieval.

Abstract

Contrastive Language-Image Pre-training (CLIP) has demonstrated remarkable generalization ability and strong performance across a wide range of vision-language tasks. However, due to the lack of region-level supervision, CLIP exhibits limited fine-grained semantic understanding. Although several methods attempt to mitigate this issue, they unintentionally disrupt the global alignment, resulting in a persistent trade-off where improving local perception simultaneously degrades global coherence. In this paper, we propose HarmoCLIP, a novel framework designed to harmonize global and region representations within CLIP. We first identify that the absence of direct alignment between local textual and visual semantics is the fundamental cause of the trade-off. To address this, HarmoCLIP introduces an explicit fine-grained semantic supervision term that directly aligns textual segments with their corresponding visual regions, effectively bridging the image region space and the textual space. To further strengthen the representation capability at the local level, our method introduces a novel Region-Language Alignment supervision strategy that promotes fine-grained semantic learning without compromising global semantic consistency. Extensive experiments demonstrate that HarmoCLIP achieves state-of-the-art (improvement up to 69.78%) performance on the global task of retrieval and yields a substantial 3.2% improvement in Top-1 accuracy on the region task of bounding-box classification, consistently outperforming prior approaches while providing a balanced, efficient, and plug-and-play solution to the global-local trade-off in CLIP. Code is available at https://github.com/Erosist/HarmoCLIP.

HarmoCLIP: Harmonizing Global and Regional Representations in Contrastive Vision-Language Models

TL;DR

HarmoCLIP addresses the persistent global–local trade-off in CLIP by introducing direct region-to-text supervision and a Text Token Space, enabling Lexeme-Region Contrastive Learning and Global–Region Alignment. The method preserves the strong global image–text alignment of CLIP while substantially improving region-level semantics, achieving state-of-the-art retrieval with balanced performance and data efficiency. Key contributions include a three-loss framework (GC, LRC, GR), a novel region-to-text supervision pathway, and thorough ablations and visualizations demonstrating improved global coherence and fine-grained perception. The approach is plug-and-play with existing CLIP-like architectures and shows strong potential for downstream tasks requiring open-vocabulary region understanding and robust cross-modal retrieval.

Abstract

Contrastive Language-Image Pre-training (CLIP) has demonstrated remarkable generalization ability and strong performance across a wide range of vision-language tasks. However, due to the lack of region-level supervision, CLIP exhibits limited fine-grained semantic understanding. Although several methods attempt to mitigate this issue, they unintentionally disrupt the global alignment, resulting in a persistent trade-off where improving local perception simultaneously degrades global coherence. In this paper, we propose HarmoCLIP, a novel framework designed to harmonize global and region representations within CLIP. We first identify that the absence of direct alignment between local textual and visual semantics is the fundamental cause of the trade-off. To address this, HarmoCLIP introduces an explicit fine-grained semantic supervision term that directly aligns textual segments with their corresponding visual regions, effectively bridging the image region space and the textual space. To further strengthen the representation capability at the local level, our method introduces a novel Region-Language Alignment supervision strategy that promotes fine-grained semantic learning without compromising global semantic consistency. Extensive experiments demonstrate that HarmoCLIP achieves state-of-the-art (improvement up to 69.78%) performance on the global task of retrieval and yields a substantial 3.2% improvement in Top-1 accuracy on the region task of bounding-box classification, consistently outperforming prior approaches while providing a balanced, efficient, and plug-and-play solution to the global-local trade-off in CLIP. Code is available at https://github.com/Erosist/HarmoCLIP.

Paper Structure

This paper contains 23 sections, 9 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Comparison of existing methods on global-awareness and region-level tasks. Figure (a) shows the trade-off between global and region-level understanding across models. Figure (b) presents the relationship between the similarities of $I_{\text{R}}$-$I_{\text{G}}$ and $I_{\text{R}}$-$T_{\text{G}}$. Retrieval performance is reported as the mean of I$\rightarrow$T@1 and T$\rightarrow$I@1 on MSCOCO, and BBox performance is measured by the Top-1 accuracy on OVCOCO.
  • Figure 2: An overview of the crucial semantic space in CLIP. Blue arrows and annotations indicate the path of HarmoCLIP while Red annotations show the limits of current methods.
  • Figure 3: $I_{\text{G}}$–$I_{\text{R}}$ vs. $I_{\text{R}}$–$T_{\text{G}}$ Concordance Matrix across models, which reflects the strong correlation of two alignment processes.
  • Figure 4: Overall architecture of HarmoCLIP. It consists of three loss functions: $\mathcal{L}_{\mathrm{GC}}$ (Global Contrastive Learning), $\mathcal{L}_{\mathrm{LRC}}$ (Lexeme–Region Contrastive Learning), and $\mathcal{L}_{\mathrm{GR}}$ (Global-Region Alignment).
  • Figure S1: Data samples in details.
  • ...and 2 more figures