Table of Contents
Fetching ...

Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models

Quang-Hung Le, Long Hoang Dang, Ngan Le, Truyen Tran, Thao Minh Le

TL;DR

PromViL addresses the challenge of grounded compositional reasoning in LVLMs by introducing progressive multi-granular vision-language alignments and a nested CompoVL dataset derived from Visual Genome. The method decomposes complex queries into nested expressions and learns grounding progressively from simple to complex levels, using a multi-level autoregressive training objective. Empirical results show PromViL outperforms baselines on visual grounding and compositional reasoning benchmarks, including zero-shot settings and out-of-distribution scenarios, with a small parameter footprint thanks to LoRA fine-tuning. The work offers a practical route to robust grounded reasoning in LVLMs and provides publicly available code and data for replication.

Abstract

Existing Large Vision-Language Models (LVLMs) excel at matching concepts across multi-modal inputs but struggle with compositional concepts and high-level relationships between entities. This paper introduces Progressive multi-granular Vision-Language alignments (PromViL), a novel framework to enhance LVLMs' ability in performing grounded compositional visual reasoning tasks. Our approach constructs a hierarchical structure of multi-modal alignments, ranging from simple to complex concepts. By progressively aligning textual descriptions with corresponding visual regions, our model learns to leverage contextual information from lower levels to inform higher-level reasoning. To facilitate this learning process, we introduce a data generation process that creates a novel dataset derived from Visual Genome, providing a wide range of nested compositional vision-language pairs. Experimental results demonstrate that our PromViL framework significantly outperforms baselines on various visual grounding and compositional question answering tasks. The code is available at: https://github.com/lqh52/PromViL.

Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models

TL;DR

PromViL addresses the challenge of grounded compositional reasoning in LVLMs by introducing progressive multi-granular vision-language alignments and a nested CompoVL dataset derived from Visual Genome. The method decomposes complex queries into nested expressions and learns grounding progressively from simple to complex levels, using a multi-level autoregressive training objective. Empirical results show PromViL outperforms baselines on visual grounding and compositional reasoning benchmarks, including zero-shot settings and out-of-distribution scenarios, with a small parameter footprint thanks to LoRA fine-tuning. The work offers a practical route to robust grounded reasoning in LVLMs and provides publicly available code and data for replication.

Abstract

Existing Large Vision-Language Models (LVLMs) excel at matching concepts across multi-modal inputs but struggle with compositional concepts and high-level relationships between entities. This paper introduces Progressive multi-granular Vision-Language alignments (PromViL), a novel framework to enhance LVLMs' ability in performing grounded compositional visual reasoning tasks. Our approach constructs a hierarchical structure of multi-modal alignments, ranging from simple to complex concepts. By progressively aligning textual descriptions with corresponding visual regions, our model learns to leverage contextual information from lower levels to inform higher-level reasoning. To facilitate this learning process, we introduce a data generation process that creates a novel dataset derived from Visual Genome, providing a wide range of nested compositional vision-language pairs. Experimental results demonstrate that our PromViL framework significantly outperforms baselines on various visual grounding and compositional question answering tasks. The code is available at: https://github.com/lqh52/PromViL.

Paper Structure

This paper contains 14 sections, 1 equation, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison with Existing LVLMs: (a) Coarse-grained: Whole image/region with full text, lacks object details. (b) Fine-grained: Simple phrases and bounding boxes, lacks relational context. (c) $\text{PromViL}$ employs hierarchical multi-granular associations, progressively utilizing simpler concepts as cues to understand more complex ones.
  • Figure 2: Overview of our $\text{PromViL}$ framework. (a) Training: Learn multi-level visual entities-textual expression associations. (b) Inference: Progressively prompt from simple to complex, using prior responses as clues. (c) Decomposition: Extract nested subsequences based on (i) constituency parsing (simplified for illustration) and (ii) dependency parsing.
  • Figure 3: Multi-granular Compositional V-L Data Generation. We create a novel dataset with rich, multi-granular V-L data using existing VG annotations.
  • Figure 4: Qualitative results on CompoVL-hard. Solid green: correct boxes, solid red: incorrect. Existing methods struggle with complex descriptions or multiple similar objects (e.g., two " women dressed in white" ). PromViL leverages simpler expressions (dashed boxes) to accurately locate complex targets. More qualitative results in Appendix.
  • Figure 5: Accuracy comparison between $\text{PromViL}$ and MiniGPTv2 on the $\text{CompoVL-hard}$ dataset.
  • ...and 1 more figures