Table of Contents
Fetching ...

Reliable Cross-modal Alignment via Prototype Iterative Construction

Xiang Ma, Litian Xu, Lexin Fang, Caiming Zhang, Lizhen Cui

TL;DR

This work tackles cross-modal alignment by addressing non-semantic style information that can bias semantic matching. It introduces PICO, a framework that adaptively weighs embedding interactions using semantic probabilities, and constructs reliable style prototypes via an iterative, performance-feedback mechanism. The approach combines pseudo-semantic probability estimation, pseudo-style prototypes, and a dynamic prototype update rule to suppress style-dominated features during interaction, yielding substantial improvements on Flickr30K and MS-COCO across multiple backbones and even VLP models. The results demonstrate better semantic alignment, robustness to style variations, and improved generalization in cross-modal retrieval tasks.

Abstract

Cross-modal alignment is an important multi-modal task, aiming to bridge the semantic gap between different modalities. The most reliable fundamention for achieving this objective lies in the semantic consistency between matched pairs. Conventional methods implicitly assume embeddings contain solely semantic information, ignoring the impact of non-semantic information during alignment, which inevitably leads to information bias or even loss. These non-semantic information primarily manifest as stylistic variations in the data, which we formally define as style information. An intuitive approach is to separate style from semantics, aligning only the semantic information. However, most existing methods distinguish them based on feature columns, which cannot represent the complex coupling relationship between semantic and style information. In this paper, we propose PICO, a novel framework for suppressing style interference during embedding interaction. Specifically, we quantify the probability of each feature column representing semantic information, and regard it as the weight during the embedding interaction. To ensure the reliability of the semantic probability, we propose a prototype iterative construction method. The key operation of this method is a performance feedback-based weighting function, and we have theoretically proven that the function can assign higher weight to prototypes that bring higher performance improvements. Extensive experiments on various benchmarks and model backbones demonstrate the superiority of PICO, outperforming state-of-the-art methods by 5.2\%-14.1\%.

Reliable Cross-modal Alignment via Prototype Iterative Construction

TL;DR

This work tackles cross-modal alignment by addressing non-semantic style information that can bias semantic matching. It introduces PICO, a framework that adaptively weighs embedding interactions using semantic probabilities, and constructs reliable style prototypes via an iterative, performance-feedback mechanism. The approach combines pseudo-semantic probability estimation, pseudo-style prototypes, and a dynamic prototype update rule to suppress style-dominated features during interaction, yielding substantial improvements on Flickr30K and MS-COCO across multiple backbones and even VLP models. The results demonstrate better semantic alignment, robustness to style variations, and improved generalization in cross-modal retrieval tasks.

Abstract

Cross-modal alignment is an important multi-modal task, aiming to bridge the semantic gap between different modalities. The most reliable fundamention for achieving this objective lies in the semantic consistency between matched pairs. Conventional methods implicitly assume embeddings contain solely semantic information, ignoring the impact of non-semantic information during alignment, which inevitably leads to information bias or even loss. These non-semantic information primarily manifest as stylistic variations in the data, which we formally define as style information. An intuitive approach is to separate style from semantics, aligning only the semantic information. However, most existing methods distinguish them based on feature columns, which cannot represent the complex coupling relationship between semantic and style information. In this paper, we propose PICO, a novel framework for suppressing style interference during embedding interaction. Specifically, we quantify the probability of each feature column representing semantic information, and regard it as the weight during the embedding interaction. To ensure the reliability of the semantic probability, we propose a prototype iterative construction method. The key operation of this method is a performance feedback-based weighting function, and we have theoretically proven that the function can assign higher weight to prototypes that bring higher performance improvements. Extensive experiments on various benchmarks and model backbones demonstrate the superiority of PICO, outperforming state-of-the-art methods by 5.2\%-14.1\%.

Paper Structure

This paper contains 16 sections, 12 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Images (or texts) with different expression styles can correspond to the same text (or image), indicating embeddings contain both semantic and non-semantic information.
  • Figure 2: Overview of PICO. (a) Traditional fine-grained cross-modal alignment. (b) The weighted fine-grained cross-modal alignment, which weights feature columns during embedding interaction. (c) Semantic probability calculation of PICO. First, statistical analysis of feature column interactions yields pseudo-semantic and pseudo-style probabilities. Next, style prototypes are extracted and refined through iterative construction. Finally, comparing features with their prototypes provides style and semantic probabilities, where semantic probability weights suppress style-dominated features during embedding interaction.
  • Figure 3: Prototype iterative construction. During epoch $0$ to $j_0$, visual-textual embedding alignment is initialized. At epoch $j_0$, pseudo-style prototypes are constructed by weighting feature columns with pseudo-style probabilities, serving as initial style prototypes. From epoch $j_1$ onward, these prototypes are iteratively updated via performance feedback-based weighting.
  • Figure 4: The visualization of correlation score'1 distribution with / without PICO's weighting process during embedding interaction. After weighting, correlation scores are more closer to both ends (0 or 1), simplifying match assessment.
  • Figure 5: The visualization of patches corresponding to words with similar meanings. The red boxes indicate the differences between patches selected by the two methods.
  • ...and 1 more figures