BACON: Improving Clarity of Image Captions via Bag-of-Concept Graphs
Zhantao Yang, Ruili Feng, Keyu Yan, Huangji Wang, Zhicai Wang, Shangwen Zhu, Han Zhang, Jie Xiao, Pingyu Wu, Kai Zhu, Jixuan Chen, Chen-Wei Xie, Yue Yang, Hongyang Zhang, Yu Liu, Fan Cheng
TL;DR
This work tackles the problem of entangled and overly long captions produced by Vision-Language Models, which hinder downstream reasoning in non-LLM-enabled systems. It introduces BACON, a prompting framework that decomposes captions into a Bag-of-Concept Graph with an overall description, an object list, and explicit relationships, encoded in a fixed VLM-friendly string format and grounded via Grounding DINO and CLIP/LLaVA filters. The authors create the ECO dataset (~103k images) and train a specialized captioner, LLaVA(BACON)-Captioner, achieving superior caption quality as measured by CQA, OV-SGG, and user studies, while enabling non-LLM models to perform tasks such as open-vocabulary detection, zero-shot region QA, dense video captioning, and improved image generation without additional training. Overall, BACON enhances multi-modal understanding and generation by providing clear, structured captions that transfer effectively across diverse models and tasks, paving the way for broader applicability of structured captions in real-world systems.
Abstract
Advancements in large Vision-Language Models have brought precise, accurate image captioning, vital for advancing multi-modal image understanding and processing. Yet these captions often carry lengthy, intertwined contexts that are difficult to parse and frequently overlook essential cues, posing a great barrier for models like GroundingDINO and SDXL, which lack the strong text encoding and syntax analysis needed to fully leverage dense captions. To address this, we propose BACON, a prompting method that breaks down VLM-generated captions into disentangled, structured elements such as objects, relationships, styles, and themes. This approach not only minimizes confusion from handling complex contexts but also allows for efficient transfer into a JSON dictionary, enabling models without linguistic processing capabilities to easily access key information. We annotated 100,000 image-caption pairs using BACON with GPT-4V and trained an LLaVA captioner on this dataset, enabling it to produce BACON-style captions without relying on costly GPT-4V. Evaluations of overall quality, precision, and recall-as well as user studies-demonstrate that the resulting caption model consistently outperforms other SOTA VLM models in generating high-quality captions. Besides, we show that BACON-style captions exhibit better clarity when applied to various models, enabling them to accomplish previously unattainable tasks or surpass existing SOTA solutions without training. For example, BACON-style captions help GroundingDINO achieve 1.51x higher recall scores on open-vocabulary object detection tasks compared to leading methods.
