StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues

Zanxi Ruan; Qiuyu Kong; Songqun Gao; Yiming Wang; Marco Cristani

StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues

Zanxi Ruan, Qiuyu Kong, Songqun Gao, Yiming Wang, Marco Cristani

TL;DR

StructXLIP introduces a structure-centric fine-tuning paradigm for vision–language models by extracting edge-based visual structure and filtering captions to emphasize geometry and layout. It augments standard image–text alignment with three losses that align edge maps with structure-centric text, pair local edge regions with textual chunks, and regularize consistency between edge and color images, all within a two-stage extraction–alignment pipeline. The approach is analyzed through an information-theoretic lens and shown to yield robust improvements in cross-modal retrieval across diverse domains, while remaining plug-and-play with existing CLIP-based finetuning frameworks. The results demonstrate stronger, more semantically stable alignment for long, detail-rich captions and offer a scalable path toward broader applications and potential training-from-scratch ventures in the future.

Abstract

Edge-based representations are fundamental cues for visual understanding, a principle rooted in early vision research and still central today. We extend this principle to vision-language alignment, showing that isolating and aligning structural cues across modalities can greatly benefit fine-tuning on long, detail-rich captions, with a specific focus on improving cross-modal retrieval. We introduce StructXLIP, a fine-tuning alignment paradigm that extracts edge maps (e.g., Canny), treating them as proxies for the visual structure of an image, and filters the corresponding captions to emphasize structural cues, making them "structure-centric". Fine-tuning augments the standard alignment loss with three structure-centric losses: (i) aligning edge maps with structural text, (ii) matching local edge regions to textual chunks, and (iii) connecting edge maps to color images to prevent representation drift. From a theoretical standpoint, while standard CLIP maximizes the mutual information between visual and textual embeddings, StructXLIP additionally maximizes the mutual information between multimodal structural representations. This auxiliary optimization is intrinsically harder, guiding the model toward more robust and semantically stable minima, enhancing vision-language alignment. Beyond outperforming current competitors on cross-modal retrieval in both general and specialized domains, our method serves as a general boosting recipe that can be integrated into future approaches in a plug-and-play manner. Code and pretrained models are publicly available at: https://github.com/intelligolabs/StructXLIP.

StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues

TL;DR

Abstract

Paper Structure (29 sections, 4 theorems, 16 equations, 16 figures, 10 tables)

This paper contains 29 sections, 4 theorems, 16 equations, 16 figures, 10 tables.

Introduction
Related Works
StructXLIP
Structure-centric multimodal extraction
Structure-centric multimodal alignment
Theoretical view
Experiments
Main comparisons
Ablation studies
Qualitative analysis
Conclusion
Dataset Details
Condensed INSECT Dataset Construction
Lexicon Filtering
Appearance Vocabulary Generation
...and 14 more sections

Key Result

Lemma 1

Let $I$ and $T$ denote the random variables associated with images and texts, and let $I' = \mathcal{E}(I)$ and $T' = \mathcal{F}(T)$ be their structure-centric counterparts, where $\mathcal{E}(\cdot)$ and $\mathcal{F}(\cdot)$ are deterministic maps that remove appearance-related information and are where $I_{\mathrm{MI}}(\cdot,\cdot)$ is the mutual information operator.

Figures (16)

Figure 1: StructXLIP performs fine-tuning by adding to the standard image-text alignment with multimodal structural cue: edge maps and structure-centric captions. StructXLIP consistently improves downstream vision–language retrieval inference.
Figure 1: Illustration of SKETCHY dataset samples across different edge-map representations. For each example, we show the original RGB image together with its caption, followed by Canny, LoG, HED, LAD, and P2S edge maps, as well as the filtered caption $T'$ that preserves only structure-centric information.
Figure 2: Examples of Visual (left) and Textual (right) extraction.
Figure 2: Illustration of INSECT dataset samples across different edge-map representations.
Figure 3: (i) Overview of StructXLIP fine-tuning which operates in two stages. The first structure-centric multimodal extraction stage extracts structural views by generating edge maps from original images via edge detector and performing lexicon filtering on original captions to remove appearance-related terms. The second structure-centric multimodal alignment stage fine-tunes the encoders $f_{\mathtt{img}}$ and $f_{\mathtt{txt}}$ by joining original image-text alignment $\mathcal{L}_{I,T}$ with our newly introduced structure-centric alignment objectives: the structure-centric image-text alignment loss$\mathcal{L}_{I',T'}$ enforces global structural alignment, the Local structure-centric image-text alignment loss$\mathcal{L}_{I',T'}^{local}$ captures local compositional semantics, and the Consistency regularization loss$\mathcal{L}_{I,I'}$ aligns raw images and edge maps to prevent representation drift. At inference, only color image ($I$) and captions ($T$) are input to the fine-tuned $f_{\mathtt{img}}$ and $f_{\mathtt{txt}}$, neither edge extraction nor lexicon filtering is required. (ii) Zoom-in on the Local structure-centric image-text alignment loss $\mathcal{L}_{I',T'}^{local}$.
...and 11 more figures

Theorems & Definitions (4)

Lemma 1: Information ordering for structure-centric views
Lemma 2: InfoNCE lower bounds for the two objectives
Lemma 3: Directional compatibility of gradients
Theorem 1: Effect of the structure-centric auxiliary losses

StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues

TL;DR

Abstract

StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (4)