Table of Contents
Fetching ...

Cross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain Modeling

Jiale Liu, Haoming Zhou, Yishu Liu, Bingzhi Chen, Yuncheng Jiang

TL;DR

The paper tackles the problem of fine-grained image-text alignment by identifying two key limitations in prior work: brittle intra-modal significance signals and a lack of region-level uncertainty modeling. It introduces GRM, a unified framework that employs Significance-aware and Granularity-aware Adapters, Region Prompting, and a Mixture-of-Gaussians representation to capture fine-grained uncertainty at the region level. Through multi-level, bidirectional alignment and semantic-consistency constraints, GRM achieves state-of-the-art results on Flickr30K and MS-COCO across multiple backbones, while improving robustness and interpretability. The approach demonstrates strong potential for downstream tasks requiring precise cross-modal grounding and detailed alignment of localized visual regions with textual tokens.

Abstract

Fine-grained image-text alignment is a pivotal challenge in multimodal learning, underpinning key applications such as visual question answering, image captioning, and vision-language navigation. Unlike global alignment, fine-grained alignment requires precise correspondence between localized visual regions and textual tokens, often hindered by noisy attention mechanisms and oversimplified modeling of cross-modal relationships. In this work, we identify two fundamental limitations of existing approaches: the lack of robust intra-modal mechanisms to assess the significance of visual and textual tokens, leading to poor generalization in complex scenes; and the absence of fine-grained uncertainty modeling, which fails to capture the one-to-many and many-to-one nature of region-word correspondences. To address these issues, we propose a unified approach that incorporates significance-aware and granularity-aware modeling and region-level uncertainty modeling. Our method leverages modality-specific biases to identify salient features without relying on brittle cross-modal attention, and represents region features as a mixture of Gaussian distributions to capture fine-grained uncertainty. Extensive experiments on Flickr30K and MS-COCO demonstrate that our approach achieves state-of-the-art performance across various backbone architectures, significantly enhancing the robustness and interpretability of fine-grained image-text alignment.

Cross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain Modeling

TL;DR

The paper tackles the problem of fine-grained image-text alignment by identifying two key limitations in prior work: brittle intra-modal significance signals and a lack of region-level uncertainty modeling. It introduces GRM, a unified framework that employs Significance-aware and Granularity-aware Adapters, Region Prompting, and a Mixture-of-Gaussians representation to capture fine-grained uncertainty at the region level. Through multi-level, bidirectional alignment and semantic-consistency constraints, GRM achieves state-of-the-art results on Flickr30K and MS-COCO across multiple backbones, while improving robustness and interpretability. The approach demonstrates strong potential for downstream tasks requiring precise cross-modal grounding and detailed alignment of localized visual regions with textual tokens.

Abstract

Fine-grained image-text alignment is a pivotal challenge in multimodal learning, underpinning key applications such as visual question answering, image captioning, and vision-language navigation. Unlike global alignment, fine-grained alignment requires precise correspondence between localized visual regions and textual tokens, often hindered by noisy attention mechanisms and oversimplified modeling of cross-modal relationships. In this work, we identify two fundamental limitations of existing approaches: the lack of robust intra-modal mechanisms to assess the significance of visual and textual tokens, leading to poor generalization in complex scenes; and the absence of fine-grained uncertainty modeling, which fails to capture the one-to-many and many-to-one nature of region-word correspondences. To address these issues, we propose a unified approach that incorporates significance-aware and granularity-aware modeling and region-level uncertainty modeling. Our method leverages modality-specific biases to identify salient features without relying on brittle cross-modal attention, and represents region features as a mixture of Gaussian distributions to capture fine-grained uncertainty. Extensive experiments on Flickr30K and MS-COCO demonstrate that our approach achieves state-of-the-art performance across various backbone architectures, significantly enhancing the robustness and interpretability of fine-grained image-text alignment.

Paper Structure

This paper contains 24 sections, 13 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The Overview of the Proposed GRM.The visual encoder $f_v(\cdot)$ and text encoder $f_t(\cdot)$ independently encode input image and text instances to obtain their respective representations, $\mathbf{V}$ and $\mathbf{T}$. These embeddings are then passed through two structurally identical but functionally distinct adapters: the Significance-aware Adapter and the Granularity-aware Adapter, which learn modality-specific distribution biases. Subsequently, the image embeddings undergo region-level prompt learning and uncertainty modeling to capture fine-grained semantic variations. Finally, a multi-level alignment strategy is applied to effectively align the cross-modal knowledge between images and texts.
  • Figure 2: The detailed architecture of the Significance-aware Adapter and the Granularity-aware Adapter.
  • Figure 3: The computation process of bidirectional image-text alignment.
  • Figure 4: Comparison of hyperparameter experimental performance. The experiments are conducted on the Flickr30k dataset. (a) Parameter study on the number of region prompts $\mathbf{P}$. The blue bar chart shows results with the visual backbone ViT-base-224, while the yellow line chart shows results with Swin-base-224. (b) Parameter study on different combinations of weights $a$, $b$ and $c$ in Equation \ref{['equ:Lcon']}. Since $a+b+c=1$, only $a$ and $b$ need to be specified. Six combinations are tested: when $a=0.2$ (blue blocks), $b=\{0.2,0.4,0.6\}$; when $a=0.4$ (yellow blocks), $b=\{0.2,0.4\}$; and when $a=0.6$ (green blocks), $b=0.2$.
  • Figure 5: The visualization of fine-grained patch-word alignment with each linguistic word.
  • ...and 1 more figures