HiMo-CLIP: Modeling Semantic Hierarchy and Monotonicity in Vision-Language Alignment
Ruijia Wu, Ping Chen, Fei Shen, Shaoan Zhao, Qiang Hui, Huanlin Gao, Ting Lu, Zhaoxiang Liu, Fang Zhao, Kai Wang, Shiguo Lian
TL;DR
This paper tackles the limitations of CLIP in handling long-form, compositional language by introducing HiMo-CLIP, a representation-level framework that enforces semantic hierarchy and monotonicity without altering encoders. It introduces HiDe, an in-batch PCA-based decomposition that extracts latent semantic components, and MoLo, a dual-branch contrastive loss aligning images with both full-text and semantic components. The method achieves state-of-the-art performance on long-form and compositional benchmarks while preserving strong short-text retrieval, and it demonstrates robust monotonic alignment as textual detail increases. Overall, HiMo-CLIP advances vision-language models toward structured, hierarchical, and monotonic cross-modal understanding in a scalable, encoder-agnostic manner.
Abstract
Contrastive vision-language models like CLIP have achieved impressive results in image-text retrieval by aligning image and text representations in a shared embedding space. However, these models often treat text as flat sequences, limiting their ability to handle complex, compositional, and long-form descriptions. In particular, they fail to capture two essential properties of language: semantic hierarchy, which reflects the multi-level compositional structure of text, and semantic monotonicity, where richer descriptions should result in stronger alignment with visual content.To address these limitations, we propose HiMo-CLIP, a representation-level framework that enhances CLIP-style models without modifying the encoder architecture. HiMo-CLIP introduces two key components: a hierarchical decomposition (HiDe) module that extracts latent semantic components from long-form text via in-batch PCA, enabling flexible, batch-aware alignment across different semantic granularities, and a monotonicity-aware contrastive loss (MoLo) that jointly aligns global and component-level representations, encouraging the model to internalize semantic ordering and alignment strength as a function of textual completeness.These components work in concert to produce structured, cognitively-aligned cross-modal representations. Experiments on multiple image-text retrieval benchmarks show that HiMo-CLIP consistently outperforms strong baselines, particularly under long or compositional descriptions. The code is available at https://github.com/UnicomAI/HiMo-CLIP.
