Table of Contents
Fetching ...

Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

Zeyu Han, Fangrui Zhu, Qianru Lao, Huaizu Jiang

TL;DR

This work tackles zero-shot referring expression comprehension by explicitly modeling relational structure between image and caption through subject–predicate–object triplets. It proposes a two-stage pipeline: (1) construct triplets from both modalities (with ChatGPT-assisted text parsing and exhaustive image pairings), (2) ground via triplet-level structural similarity and propagate to instance-level grounding, complemented by a triplet-based contrastive fine-tuning of vision-language models on relational data. The approach achieves substantial gains on RefCOCO/+/g (up to 19.5% over the previous zero-shot state of the art) and matches supervised methods on Who’s Waldo, demonstrating improved relational understanding. The methodology combines foundation-model parsing, structured representation, and relational fine-tuning to enhance zero-shot visual grounding in complex scenes.

Abstract

Zero-shot referring expression comprehension aims at localizing bounding boxes in an image corresponding to provided textual prompts, which requires: (i) a fine-grained disentanglement of complex visual scene and textual context, and (ii) a capacity to understand relationships among disentangled entities. Unfortunately, existing large vision-language alignment (VLA) models, e.g., CLIP, struggle with both aspects so cannot be directly used for this task. To mitigate this gap, we leverage large foundation models to disentangle both images and texts into triplets in the format of (subject, predicate, object). After that, grounding is accomplished by calculating the structural similarity matrix between visual and textual triplets with a VLA model, and subsequently propagate it to an instance-level similarity matrix. Furthermore, to equip VLA models with the ability of relationship understanding, we design a triplet-matching objective to fine-tune the VLA models on a collection of curated dataset containing abundant entity relationships. Experiments demonstrate that our visual grounding performance increase of up to 19.5% over the SOTA zero-shot model on RefCOCO/+/g. On the more challenging Who's Waldo dataset, our zero-shot approach achieves comparable accuracy to the fully supervised model. Code is available at https://github.com/Show-han/Zeroshot_REC.

Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

TL;DR

This work tackles zero-shot referring expression comprehension by explicitly modeling relational structure between image and caption through subject–predicate–object triplets. It proposes a two-stage pipeline: (1) construct triplets from both modalities (with ChatGPT-assisted text parsing and exhaustive image pairings), (2) ground via triplet-level structural similarity and propagate to instance-level grounding, complemented by a triplet-based contrastive fine-tuning of vision-language models on relational data. The approach achieves substantial gains on RefCOCO/+/g (up to 19.5% over the previous zero-shot state of the art) and matches supervised methods on Who’s Waldo, demonstrating improved relational understanding. The methodology combines foundation-model parsing, structured representation, and relational fine-tuning to enhance zero-shot visual grounding in complex scenes.

Abstract

Zero-shot referring expression comprehension aims at localizing bounding boxes in an image corresponding to provided textual prompts, which requires: (i) a fine-grained disentanglement of complex visual scene and textual context, and (ii) a capacity to understand relationships among disentangled entities. Unfortunately, existing large vision-language alignment (VLA) models, e.g., CLIP, struggle with both aspects so cannot be directly used for this task. To mitigate this gap, we leverage large foundation models to disentangle both images and texts into triplets in the format of (subject, predicate, object). After that, grounding is accomplished by calculating the structural similarity matrix between visual and textual triplets with a VLA model, and subsequently propagate it to an instance-level similarity matrix. Furthermore, to equip VLA models with the ability of relationship understanding, we design a triplet-matching objective to fine-tune the VLA models on a collection of curated dataset containing abundant entity relationships. Experiments demonstrate that our visual grounding performance increase of up to 19.5% over the SOTA zero-shot model on RefCOCO/+/g. On the more challenging Who's Waldo dataset, our zero-shot approach achieves comparable accuracy to the fully supervised model. Code is available at https://github.com/Show-han/Zeroshot_REC.
Paper Structure (15 sections, 7 equations, 10 figures, 6 tables)

This paper contains 15 sections, 7 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Illustration of how we disambiguate visual entities based on their interactions with other entities. The same entity or relationships in the image and caption are in the same color.
  • Figure 2: Illustration of the triplet-level structural similarity. Visual and textual triplets are encoded by image encoder and text encoder, respectively. Then the structural similarity is calculated as the sum of cosine similarities between subject, predicate, and object.
  • Figure 3: Illustration of leveraging ChatGPT's powerful in-context learning capability to parse a caption into triplets.
  • Figure 4: Illustration of propagating the similarity scores from grounded triplets to the instance level. Via the aggregation of the similarity scores from multiple grounded triplets, it helps find the instance-level correspondences more accurately. For instance, in the lower part, the referring expression a man and the blue bounding box appear in two different triplets, acting as the subject and object, respectively. Such structural similarity provide more useful cues to improve the instance-level grounding. (Best viewed in color.)
  • Figure 5: Zero-shot visual grounding results. Left two columns are results from RefCOCO, where our predictions are in green box, distraction objects are in red box. The rightmost column shows results from Who's Waldo, where predicted annotation links are in the same color. Arrows represent relationships between visual objects, and the text on the images are the parsed triplets.
  • ...and 5 more figures