Table of Contents
Fetching ...

BACON: Improving Clarity of Image Captions via Bag-of-Concept Graphs

Zhantao Yang, Ruili Feng, Keyu Yan, Huangji Wang, Zhicai Wang, Shangwen Zhu, Han Zhang, Jie Xiao, Pingyu Wu, Kai Zhu, Jixuan Chen, Chen-Wei Xie, Yue Yang, Hongyang Zhang, Yu Liu, Fan Cheng

TL;DR

This work tackles the problem of entangled and overly long captions produced by Vision-Language Models, which hinder downstream reasoning in non-LLM-enabled systems. It introduces BACON, a prompting framework that decomposes captions into a Bag-of-Concept Graph with an overall description, an object list, and explicit relationships, encoded in a fixed VLM-friendly string format and grounded via Grounding DINO and CLIP/LLaVA filters. The authors create the ECO dataset (~103k images) and train a specialized captioner, LLaVA(BACON)-Captioner, achieving superior caption quality as measured by CQA, OV-SGG, and user studies, while enabling non-LLM models to perform tasks such as open-vocabulary detection, zero-shot region QA, dense video captioning, and improved image generation without additional training. Overall, BACON enhances multi-modal understanding and generation by providing clear, structured captions that transfer effectively across diverse models and tasks, paving the way for broader applicability of structured captions in real-world systems.

Abstract

Advancements in large Vision-Language Models have brought precise, accurate image captioning, vital for advancing multi-modal image understanding and processing. Yet these captions often carry lengthy, intertwined contexts that are difficult to parse and frequently overlook essential cues, posing a great barrier for models like GroundingDINO and SDXL, which lack the strong text encoding and syntax analysis needed to fully leverage dense captions. To address this, we propose BACON, a prompting method that breaks down VLM-generated captions into disentangled, structured elements such as objects, relationships, styles, and themes. This approach not only minimizes confusion from handling complex contexts but also allows for efficient transfer into a JSON dictionary, enabling models without linguistic processing capabilities to easily access key information. We annotated 100,000 image-caption pairs using BACON with GPT-4V and trained an LLaVA captioner on this dataset, enabling it to produce BACON-style captions without relying on costly GPT-4V. Evaluations of overall quality, precision, and recall-as well as user studies-demonstrate that the resulting caption model consistently outperforms other SOTA VLM models in generating high-quality captions. Besides, we show that BACON-style captions exhibit better clarity when applied to various models, enabling them to accomplish previously unattainable tasks or surpass existing SOTA solutions without training. For example, BACON-style captions help GroundingDINO achieve 1.51x higher recall scores on open-vocabulary object detection tasks compared to leading methods.

BACON: Improving Clarity of Image Captions via Bag-of-Concept Graphs

TL;DR

This work tackles the problem of entangled and overly long captions produced by Vision-Language Models, which hinder downstream reasoning in non-LLM-enabled systems. It introduces BACON, a prompting framework that decomposes captions into a Bag-of-Concept Graph with an overall description, an object list, and explicit relationships, encoded in a fixed VLM-friendly string format and grounded via Grounding DINO and CLIP/LLaVA filters. The authors create the ECO dataset (~103k images) and train a specialized captioner, LLaVA(BACON)-Captioner, achieving superior caption quality as measured by CQA, OV-SGG, and user studies, while enabling non-LLM models to perform tasks such as open-vocabulary detection, zero-shot region QA, dense video captioning, and improved image generation without additional training. Overall, BACON enhances multi-modal understanding and generation by providing clear, structured captions that transfer effectively across diverse models and tasks, paving the way for broader applicability of structured captions in real-world systems.

Abstract

Advancements in large Vision-Language Models have brought precise, accurate image captioning, vital for advancing multi-modal image understanding and processing. Yet these captions often carry lengthy, intertwined contexts that are difficult to parse and frequently overlook essential cues, posing a great barrier for models like GroundingDINO and SDXL, which lack the strong text encoding and syntax analysis needed to fully leverage dense captions. To address this, we propose BACON, a prompting method that breaks down VLM-generated captions into disentangled, structured elements such as objects, relationships, styles, and themes. This approach not only minimizes confusion from handling complex contexts but also allows for efficient transfer into a JSON dictionary, enabling models without linguistic processing capabilities to easily access key information. We annotated 100,000 image-caption pairs using BACON with GPT-4V and trained an LLaVA captioner on this dataset, enabling it to produce BACON-style captions without relying on costly GPT-4V. Evaluations of overall quality, precision, and recall-as well as user studies-demonstrate that the resulting caption model consistently outperforms other SOTA VLM models in generating high-quality captions. Besides, we show that BACON-style captions exhibit better clarity when applied to various models, enabling them to accomplish previously unattainable tasks or surpass existing SOTA solutions without training. For example, BACON-style captions help GroundingDINO achieve 1.51x higher recall scores on open-vocabulary object detection tasks compared to leading methods.
Paper Structure (39 sections, 1 equation, 23 figures, 7 tables)

This paper contains 39 sections, 1 equation, 23 figures, 7 tables.

Figures (23)

  • Figure 1: The BACON-style captions consist of three components: an overall description, an object list, and relationships. Each object in the object list is accompanied by its category information, detailed description, and color information.
  • Figure 2: (a) Prompts targeting basic elements produce more pronounced attention maps (particularly in crimson) in the target region. (b) Statistical analysis shows the prompts focusing on basic elements have higher average attention values in the target area, indicating an enhanced understanding of VLMs. (c) Prompts targeting basic elements lead to far more consistent answers.
  • Figure 3: (a) Detailed method for obtaining bounding boxes for BACON-style captions: 1) Extract BACON-style captions without bounding boxes from images using GPT-4V or LLaVA(BACON)-Captioner; 2) Generate candidate regions using Grounding DINO given the object name; 3) Use LLaVA to delete clearly incorrect regions; 4) Use CLIP to select the region that best matches the object description. (b) Ablation study conducted on ECO dataset, showing the improvement of introducing CLIP and LLaVA.
  • Figure 4: CQA for evaluating caption quality, where a fixed QA model answers image-related questions based on the caption of the image instead of the image itself.
  • Figure 5: Win rate of pairwise comparisons between popular VLM-based captioners and LLaVA(BACON)-Captioner.
  • ...and 18 more figures