Table of Contents
Fetching ...

HiMo-CLIP: Modeling Semantic Hierarchy and Monotonicity in Vision-Language Alignment

Ruijia Wu, Ping Chen, Fei Shen, Shaoan Zhao, Qiang Hui, Huanlin Gao, Ting Lu, Zhaoxiang Liu, Fang Zhao, Kai Wang, Shiguo Lian

TL;DR

This paper tackles the limitations of CLIP in handling long-form, compositional language by introducing HiMo-CLIP, a representation-level framework that enforces semantic hierarchy and monotonicity without altering encoders. It introduces HiDe, an in-batch PCA-based decomposition that extracts latent semantic components, and MoLo, a dual-branch contrastive loss aligning images with both full-text and semantic components. The method achieves state-of-the-art performance on long-form and compositional benchmarks while preserving strong short-text retrieval, and it demonstrates robust monotonic alignment as textual detail increases. Overall, HiMo-CLIP advances vision-language models toward structured, hierarchical, and monotonic cross-modal understanding in a scalable, encoder-agnostic manner.

Abstract

Contrastive vision-language models like CLIP have achieved impressive results in image-text retrieval by aligning image and text representations in a shared embedding space. However, these models often treat text as flat sequences, limiting their ability to handle complex, compositional, and long-form descriptions. In particular, they fail to capture two essential properties of language: semantic hierarchy, which reflects the multi-level compositional structure of text, and semantic monotonicity, where richer descriptions should result in stronger alignment with visual content.To address these limitations, we propose HiMo-CLIP, a representation-level framework that enhances CLIP-style models without modifying the encoder architecture. HiMo-CLIP introduces two key components: a hierarchical decomposition (HiDe) module that extracts latent semantic components from long-form text via in-batch PCA, enabling flexible, batch-aware alignment across different semantic granularities, and a monotonicity-aware contrastive loss (MoLo) that jointly aligns global and component-level representations, encouraging the model to internalize semantic ordering and alignment strength as a function of textual completeness.These components work in concert to produce structured, cognitively-aligned cross-modal representations. Experiments on multiple image-text retrieval benchmarks show that HiMo-CLIP consistently outperforms strong baselines, particularly under long or compositional descriptions. The code is available at https://github.com/UnicomAI/HiMo-CLIP.

HiMo-CLIP: Modeling Semantic Hierarchy and Monotonicity in Vision-Language Alignment

TL;DR

This paper tackles the limitations of CLIP in handling long-form, compositional language by introducing HiMo-CLIP, a representation-level framework that enforces semantic hierarchy and monotonicity without altering encoders. It introduces HiDe, an in-batch PCA-based decomposition that extracts latent semantic components, and MoLo, a dual-branch contrastive loss aligning images with both full-text and semantic components. The method achieves state-of-the-art performance on long-form and compositional benchmarks while preserving strong short-text retrieval, and it demonstrates robust monotonic alignment as textual detail increases. Overall, HiMo-CLIP advances vision-language models toward structured, hierarchical, and monotonic cross-modal understanding in a scalable, encoder-agnostic manner.

Abstract

Contrastive vision-language models like CLIP have achieved impressive results in image-text retrieval by aligning image and text representations in a shared embedding space. However, these models often treat text as flat sequences, limiting their ability to handle complex, compositional, and long-form descriptions. In particular, they fail to capture two essential properties of language: semantic hierarchy, which reflects the multi-level compositional structure of text, and semantic monotonicity, where richer descriptions should result in stronger alignment with visual content.To address these limitations, we propose HiMo-CLIP, a representation-level framework that enhances CLIP-style models without modifying the encoder architecture. HiMo-CLIP introduces two key components: a hierarchical decomposition (HiDe) module that extracts latent semantic components from long-form text via in-batch PCA, enabling flexible, batch-aware alignment across different semantic granularities, and a monotonicity-aware contrastive loss (MoLo) that jointly aligns global and component-level representations, encouraging the model to internalize semantic ordering and alignment strength as a function of textual completeness.These components work in concert to produce structured, cognitively-aligned cross-modal representations. Experiments on multiple image-text retrieval benchmarks show that HiMo-CLIP consistently outperforms strong baselines, particularly under long or compositional descriptions. The code is available at https://github.com/UnicomAI/HiMo-CLIP.

Paper Structure

This paper contains 18 sections, 22 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: (a) Text descriptions of an image often grow in semantic richness, from short to long, by adding more visual details. (b) However, existing models, even those tailored for long-form text, often fail to preserve semantic monotonicity, overlooking this essential principle when scaling to richer descriptions. In contrast, HiMo-CLIP maintains alignment consistency across text granularities, effectively addressing this overlooked yet critical challenge. (Note: FineLIP’s similarity exceeds 1 due to its customized test-time scaling.)
  • Figure 2: HiMo-CLIP Framework. Our method enhances CLIP with two encoder-agnostic modules: (1) HiDe performs in-batch PCA to extract the most discriminative semantic components from each text, adapting to batch context and revealing dynamic semantic hierarchy. For instance, the same text (red dashed box) yields different components in different batches. (2) MoLo enforces a dual alignment objective, aligning images with both full-text embeddings (Global Alignment) and their primary semantic components (Component-Level Alignment), promoting semantic monotonicity.
  • Figure 3: Semantic monotonicity on HiMo-Docci. Each image is paired with five increasingly complete subtexts (HiMo@5). HiMo-CLIP shows consistent score increases, unlike other methods. All scores are sample-normalized for fair comparison.
  • Figure 4: Semantic monotonicity across HiMo@2 and HiMo@3 tasks. (a) HiMo@2: Each image is paired with two text segments of increasing completeness, evaluating whether richer descriptions lead to stronger alignment. (b) HiMo@3: The image is matched with three hierarchical segments (short, medium, long), following the setup in Fig. \ref{['fig:intro']}, where alignment should grow with textual detail. Green/red indicates correct/incorrect semantic monotonicity. FineLIP scores may exceed 1 due to its official score fusion strategy.
  • Figure 5: Semantic monotonicity results on extended HiMo@K tasks. (a) HiMo@4: Each image is paired with 4 subtexts of increasing semantic detail for finer-grained monotonicity testing. (b) HiMo@7: Each image is paired with 7 increasingly rich captions, enabling stricter evaluation of hierarchical alignment than Fig. \ref{['fig:himo@23']}. Green/red markers indicate correct/incorrect semantic orderings. FineLIP scores may exceed 1 due to its fusion strategy.
  • ...and 2 more figures