Table of Contents
Fetching ...

BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions

Anas Awadalla, Le Xue, Manli Shu, An Yan, Jun Wang, Senthil Purushwalkam, Sheng Shen, Hannah Lee, Oscar Lo, Jae Sung Park, Etash Guha, Silvio Savarese, Ludwig Schmidt, Yejin Choi, Caiming Xiong, Ran Xu

TL;DR

KALE tackles the gap between descriptive synthetic captions and factual web alt-text by creating knowledge-augmented dense captions through a two-stage pipeline. It first generates initial captions with CogVLM-17B and enriches them with Mistral, then trains a distilled LLaVA-like VLM to scale to 218M image-text pairs. The resulting KALE dataset yields improved downstream performance across diverse vision-language benchmarks compared to baselines like Datacomp and LAION-COCO, demonstrating the value of knowledge grounding in multimodal pretraining. The work also emphasizes efficiency via model distillation to enable large-scale data generation, with plans to extend to billions of examples and broader tasks.

Abstract

We introduce BLIP3-KALE, a dataset of 218 million image-text pairs that bridges the gap between descriptive synthetic captions and factual web-scale alt-text. KALE augments synthetic dense image captions with web-scale alt-text to generate factually grounded image captions. Our two-stage approach leverages large vision-language models and language models to create knowledge-augmented captions, which are then used to train a specialized VLM for scaling up the dataset. We train vision-language models on KALE and demonstrate improvements on vision-language tasks. Our experiments show the utility of KALE for training more capable and knowledgeable multimodal models. We release the KALE dataset at https://huggingface.co/datasets/Salesforce/blip3-kale

BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions

TL;DR

KALE tackles the gap between descriptive synthetic captions and factual web alt-text by creating knowledge-augmented dense captions through a two-stage pipeline. It first generates initial captions with CogVLM-17B and enriches them with Mistral, then trains a distilled LLaVA-like VLM to scale to 218M image-text pairs. The resulting KALE dataset yields improved downstream performance across diverse vision-language benchmarks compared to baselines like Datacomp and LAION-COCO, demonstrating the value of knowledge grounding in multimodal pretraining. The work also emphasizes efficiency via model distillation to enable large-scale data generation, with plans to extend to billions of examples and broader tasks.

Abstract

We introduce BLIP3-KALE, a dataset of 218 million image-text pairs that bridges the gap between descriptive synthetic captions and factual web-scale alt-text. KALE augments synthetic dense image captions with web-scale alt-text to generate factually grounded image captions. Our two-stage approach leverages large vision-language models and language models to create knowledge-augmented captions, which are then used to train a specialized VLM for scaling up the dataset. We train vision-language models on KALE and demonstrate improvements on vision-language tasks. Our experiments show the utility of KALE for training more capable and knowledgeable multimodal models. We release the KALE dataset at https://huggingface.co/datasets/Salesforce/blip3-kale

Paper Structure

This paper contains 11 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of KALE dataset creation and performance. Top: Example showing how KALE combines web alt-text with synthetic captions to produce knowledge-rich dense captions. Bottom left: Two-stage generation pipeline for KALE, using CogVLM and Mistral to create an initial set of knowledge augmented captions, followed by training a distilled VLM to scale up to 218M samples. Bottom right: Evaluation results comparing KALE's average performance against popular synthetic image-text datasets.
  • Figure 2: We generate KALE in a two stage process. Stage 1: We first create an initial pool of 100M knowledge-augmented dense captions using CogVLM-17B to generate dense captions for Datacomp-1B images and then augmenting these captions with real world knowledge by prompting Mistral. Stage 2: We use the knowledge-augmented captions from Stage 1 to train a VLM that takes image patch embeddings and Datacomp-1B captions as inputs and outputs knowledge-augmented captions. This VLM is then used to efficiently caption 118M more images from Datacomp-1B.
  • Figure 3: Example of pipeline artifacts in a caption. The highlighted text in red shows phrases that have leaked from the system prompt into the final output.