Table of Contents
Fetching ...

Barking Up The Syntactic Tree: Enhancing VLM Training with Syntactic Losses

Jiayun Luo, Mir Rayat Imtiaz Hossain, Boyang Li, Leonid Sigal

TL;DR

This work addresses the limited utility of coarse image-caption supervision in Vision-Language Models by injecting hierarchical textual structure into training. The HIerarchically STructured (HIST) framework decomposes captions into Subjects, Phrases, and Composite Phrases and enforces entailment with the image via three losses: Subject Loss, Phrase Loss, and Composition Loss, without extra annotations. Empirically, HIST yields substantial gains in visual grounding and complex referring segmentation, with collateral improvements in image-text retrieval and VQA, and generalizes to LM-centric architectures like TinyLLaVA. Overall, HIST demonstrates that leveraging hierarchical syntax in text provides robust, model-agnostic benefits for spatial localization and multimodal understanding.

Abstract

Vision-Language Models (VLMs) implicitly learn to associate image regions with words from large-scale training data, demonstrating an emergent capability for grounding concepts without dense annotations[14,18,51]. However, the coarse-grained supervision from image-caption pairs is often insufficient to resolve ambiguities in object-concept correspondence, even with enormous data volume. Rich semantic and syntactic structures within the text modality have been overlooked as sources of supervision. Starting from contrastive architectures (BLIP and ALBEF) that show strong intrinsic grounding abilities, we propose HIerarchically STructured Learning (HIST). HIST enhances spatial vision-language alignment without using additional human annotations, by hierarchically decomposing captions into the constituent Subjects, Phrases, and Composite Phrases, and enforcing entailment relation between a parent and its children in the hierarchy. Specifically, we introduce two novel loss functions: (1) Subject Loss, which aligns image content with the subject of the corresponding phrase, acting as an entailment of standard contrastive/matching losses at the Phrase level; (2) Composition Loss, to balance attention across multiple objects. HIST is general, and can be applied to any VLM for which attention between vision and language can be computed. Compared to baseline VLMs, HIST achieves up to +9.8% improvement in visual grounding and +6.3% in multi-object referring segmentation. Surprisingly, the improved spatial grounding leads to improvements in other downstream VLM tasks: +1.1% in image-text retrieval, and +0.2% in visual question answering.

Barking Up The Syntactic Tree: Enhancing VLM Training with Syntactic Losses

TL;DR

This work addresses the limited utility of coarse image-caption supervision in Vision-Language Models by injecting hierarchical textual structure into training. The HIerarchically STructured (HIST) framework decomposes captions into Subjects, Phrases, and Composite Phrases and enforces entailment with the image via three losses: Subject Loss, Phrase Loss, and Composition Loss, without extra annotations. Empirically, HIST yields substantial gains in visual grounding and complex referring segmentation, with collateral improvements in image-text retrieval and VQA, and generalizes to LM-centric architectures like TinyLLaVA. Overall, HIST demonstrates that leveraging hierarchical syntax in text provides robust, model-agnostic benefits for spatial localization and multimodal understanding.

Abstract

Vision-Language Models (VLMs) implicitly learn to associate image regions with words from large-scale training data, demonstrating an emergent capability for grounding concepts without dense annotations[14,18,51]. However, the coarse-grained supervision from image-caption pairs is often insufficient to resolve ambiguities in object-concept correspondence, even with enormous data volume. Rich semantic and syntactic structures within the text modality have been overlooked as sources of supervision. Starting from contrastive architectures (BLIP and ALBEF) that show strong intrinsic grounding abilities, we propose HIerarchically STructured Learning (HIST). HIST enhances spatial vision-language alignment without using additional human annotations, by hierarchically decomposing captions into the constituent Subjects, Phrases, and Composite Phrases, and enforcing entailment relation between a parent and its children in the hierarchy. Specifically, we introduce two novel loss functions: (1) Subject Loss, which aligns image content with the subject of the corresponding phrase, acting as an entailment of standard contrastive/matching losses at the Phrase level; (2) Composition Loss, to balance attention across multiple objects. HIST is general, and can be applied to any VLM for which attention between vision and language can be computed. Compared to baseline VLMs, HIST achieves up to +9.8% improvement in visual grounding and +6.3% in multi-object referring segmentation. Surprisingly, the improved spatial grounding leads to improvements in other downstream VLM tasks: +1.1% in image-text retrieval, and +0.2% in visual question answering.

Paper Structure

This paper contains 13 sections, 6 equations, 4 figures, 8 tables, 1 algorithm.

Figures (4)

  • Figure 1: Motivation. Existing VLM models train from unstructured image-caption pairs. The proposed HIerarchically STructured (HIST) learning framework decompose captions into hierarchy of phrases, establishing entailment between phrases and the image, and among the phrases themselves using proposed losses. Specifically, it extracts subjects ( Subject level) from phrases and aligns them to the image, along with the corresponding phrases themselves ( Phrase level). In addition, it combines two phrases together into longer composite phrase and regularizes attention to be a sum of the constituent sub-phrases ( Composite Phrase level).
  • Figure 2: The overall structure of HIST. We decompose image captions into object-centric phrases and build a three-level hierarchy -- Subject level, Phrase level, and Composite Phrase level. Entailment between these constituent components of the sentence and the image, allows us to formulate additional regularization constraints for training of VLMs. Specifically, we leverage three losses. At the Phrase Level we ensure alignment (entailment) between phrases and the image by leveraging standard VLM loses typically applied for image-caption pairs: image-text contrastive (ITC), image-text matching (ITM), and masked language modeling (MLM) or language modeling (LM), depending on the model. At the Subject Level we similarly use ITC and ITM loses (SITC and SITM), but focus on matching the image and the subject of the phrase. This prevents adjectives in the phrase from obscuring the text-image alignment. The final Composition loss (LCOMP) requires the sum of the attention maps for the phrases to be as close as possible to the composite phrase attention. This loss encourages the model to attend to multiple objects concurrently, rather than disproportionately focusing on the most prominent one. In computing this loss we leverage a product of cross-attention map and GradCam illustrated for each layer and head.
  • Figure 3: Qualitative Result for ALBEF, /w SelfEQ and /w HIST on Visual Grounding and Referring Segmentation. Images are from RefCOCO+. The enlarged red stars represent the top 4 locations with the highest predicted attention value from respective methods. To obtain segmentation mask, we input the four points into SAM SAM as point prompts. We note that the HIST accurately detects all objects.
  • Figure 4: Qualitative Result for TinyLLaVA and TinyLLaVA + HIST on Visual Grounding. Images are from Flickr30K.