Table of Contents
Fetching ...

COCO-Tree: Compositional Hierarchical Concept Trees for Enhanced Reasoning in Vision Language Models

Sanchit Sinha, Guangzhi Xiong, Aidong Zhang

TL;DR

COCO-Tree addresses core compositionality gaps in vision-language models by introducing hierarchical concept trees derived from an LLM and performing beam-search-inspired path exploration to identify reasoning pathways. A neurosymbolic reasoning module (System-2) is fused with the base VLM (System-1) via a weighted combination, with interpretable reasoning along the selected path. Empirical results on four benchmarks (Winoground, EqBench, ColorSwap, SugarCrepe) across seven open-source VLMs show consistent improvements of about 5–10% in compositional generalization, with ablations clarifying the roles of tree depth, branching, and the balancing hyperparameters $\\alpha$ and $\\beta$. The approach offers a resource-efficient alternative to large LLMs while enhancing interpretability, though it introduces potential hallucination risks and computational overhead from multi-stage reasoning.

Abstract

Compositional reasoning remains a persistent weakness of modern vision language models (VLMs): they often falter when a task hinges on understanding how multiple objects, attributes, and relations interact within an image. Multiple research works have attempted to improve compositionality performance by creative tricks such as improving prompt structure, chain of thought reasoning, etc. A more recent line of work attempts to impart additional reasoning in VLMs using well-trained Large Language Models (LLMs), which are far superior in linguistic understanding than VLMs to compensate for the limited linguistic prowess of VLMs. However, these approaches are either resource-intensive or do not provide an interpretable reasoning process. In this paper, we present 'COCO-Tree' - a novel approach that augments VLM outputs with carefully designed neurosymbolic concept trees learned from LLMs to improve VLM's linguistic reasoning. COCO-Tree's beam search-inspired reasoning process boosts compositionality performance and provides a rationale behind VLM predictions. Empirical results on four compositionality benchmarks, Winoground, EqBench, ColorSwap, and SugarCrepe, in seven different open-source VLMs with varying sizes, demonstrate that COCO-Tree significantly improves compositional generalization by 5-10% over baselines.

COCO-Tree: Compositional Hierarchical Concept Trees for Enhanced Reasoning in Vision Language Models

TL;DR

COCO-Tree addresses core compositionality gaps in vision-language models by introducing hierarchical concept trees derived from an LLM and performing beam-search-inspired path exploration to identify reasoning pathways. A neurosymbolic reasoning module (System-2) is fused with the base VLM (System-1) via a weighted combination, with interpretable reasoning along the selected path. Empirical results on four benchmarks (Winoground, EqBench, ColorSwap, SugarCrepe) across seven open-source VLMs show consistent improvements of about 5–10% in compositional generalization, with ablations clarifying the roles of tree depth, branching, and the balancing hyperparameters and . The approach offers a resource-efficient alternative to large LLMs while enhancing interpretability, though it introduces potential hallucination risks and computational overhead from multi-stage reasoning.

Abstract

Compositional reasoning remains a persistent weakness of modern vision language models (VLMs): they often falter when a task hinges on understanding how multiple objects, attributes, and relations interact within an image. Multiple research works have attempted to improve compositionality performance by creative tricks such as improving prompt structure, chain of thought reasoning, etc. A more recent line of work attempts to impart additional reasoning in VLMs using well-trained Large Language Models (LLMs), which are far superior in linguistic understanding than VLMs to compensate for the limited linguistic prowess of VLMs. However, these approaches are either resource-intensive or do not provide an interpretable reasoning process. In this paper, we present 'COCO-Tree' - a novel approach that augments VLM outputs with carefully designed neurosymbolic concept trees learned from LLMs to improve VLM's linguistic reasoning. COCO-Tree's beam search-inspired reasoning process boosts compositionality performance and provides a rationale behind VLM predictions. Empirical results on four compositionality benchmarks, Winoground, EqBench, ColorSwap, and SugarCrepe, in seven different open-source VLMs with varying sizes, demonstrate that COCO-Tree significantly improves compositional generalization by 5-10% over baselines.

Paper Structure

This paper contains 24 sections, 12 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: An example of the measure of compositionality problem from the Winoground dataset. The VLM is successful in identifying the presence of a bird and snake in the image but is unable to correctly understand the relations between them.
  • Figure 2: A schematic approach demonstrating the major components of our proposed approach. (a) Semantic Morphological Decomposition which decomposes a caption into morphological entities to disentangle structure and semantics. (b) Process of Recursive Concept Exploration, wherein new concepts are discovered. (c) Dynamic Path Selection and implied Neurosymbolic reasoning pathways. The numbers represent the composite scores and the green arrows represent the reasoning path selected.
  • Figure 3: Ablation study on the impact of composite score hyperparameters $\alpha$ and $\beta$. The color gradient represents the accuracy with deep Yellow being the maximum and deep Purple being the minimum scores. Top: LLaVA-1.5-7b, Bottom: InstructBLIP-XXL.
  • Figure 4: The reasoning pathway for two randomly chosen test samples from the Winoground dataset using LLava-1.5-7b. Prediction scores represent the reasoning path probability of a positive and a negative sample.
  • Figure 5: Prompt template used to generate morphological entities for function $F_{SMD}$ using an LLM.
  • ...and 2 more figures