Causal Graphical Models for Vision-Language Compositional Understanding
Fiorenzo Parascandolo, Nicholas Moratelli, Enver Sangineto, Lorenzo Baraldi, Rita Cucchiara
TL;DR
This paper addresses the limited compositional understanding of Vision-Language Models (VLMs) by modeling dependencies among textual and visual tokens with a Causal Graphical Model (CGM) derived from a dependency tree. It introduces COGT, a decoder trained with a semi-parallel, CGM-guided generation where the joint caption distribution $P(W_1, ..., W_n | S_1, ..., S_n, Z_1, ..., Z_m) = \prod_{j=1}^n P(W_j | PA(W_j))$ enforces a sparse, causally meaningful structure. A mapping network projects visual features from the VLM, and the decoder uses 45 syntactic-category specific masked tokens plus a visible token per word, employing Dependency Guided Attention to predict each word conditioned on its CGM parents and visual context. Empirically, COGT achieves state-of-the-art results on five compositional benchmarks across multiple backbones (CLIP, XVLM, InstructBLIP), often with less training data, and ablation studies confirm the critical role of the parser, mask-specific tokens, and multi-layer visual features. The work demonstrates that enforcing causal structure in vision-language understanding improves generalization to fine-grained compositional tasks and offers a practical path toward robust, data-efficient VLMs.
Abstract
Recent work has empirically shown that Vision-Language Models (VLMs) struggle to fully understand the compositional properties of the human language, usually modeling an image caption as a "bag of words". As a result, they perform poorly on compositional tasks, which require a deeper understanding of the different entities of a sentence (subject, verb, etc.) jointly with their mutual relationships in order to be solved. In this paper, we model the dependency relations among textual and visual tokens using a Causal Graphical Model (CGM), built using a dependency parser, and we train a decoder conditioned by the VLM visual encoder. Differently from standard autoregressive or parallel predictions, our decoder's generative process is partially-ordered following the CGM structure. This structure encourages the decoder to learn only the main causal dependencies in a sentence discarding spurious correlations. Using extensive experiments on five compositional benchmarks, we show that our method significantly outperforms all the state-of-the-art compositional approaches by a large margin, and it also improves over methods trained using much larger datasets.
