Table of Contents
Fetching ...

Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

Yunhao Ge, Xiaohui Zeng, Jacob Samuel Huffman, Tsung-Yi Lin, Ming-Yu Liu, Yin Cui

TL;DR

VisualFactChecker tackles the persistent problem of hallucination and insufficient detail in automatic visual captioning by introducing a training-free pipeline that chains open-source captioners under an LLM-driven verifier with grounding tools. The three-stage process—propose, verify, and caption—applies consistently to both 2D images and 3D objects, while a novel CLIP-Image-Score metric uses reconstruction fidelity to assess caption quality. Across COCO and Objaverse, VFC achieves state-of-the-art performance among open-source methods and approaches, or matches, proprietary models like GPT-4V despite a substantially smaller footprint. The approach demonstrates that composing diverse open tools via LLM reasoning can deliver high-fidelity, instruction-following captions suitable for downstream tasks and evaluation, with strong evidence from human, GPT-4V, and reconstruction-based metrics.

Abstract

Existing automatic captioning methods for visual content face challenges such as lack of detail, content hallucination, and poor instruction following. In this work, we propose VisualFactChecker (VFC), a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text captioning models propose multiple initial captions; 2) verification, where a large language model (LLM) utilizes tools such as object detection and VQA models to fact-check proposed captions; 3) captioning, where an LLM generates the final caption by summarizing caption proposals and the fact check verification results. In this step, VFC can flexibly generate captions in various styles following complex instructions. We conduct comprehensive captioning evaluations using four metrics: 1) CLIP-Score for image-text similarity; 2) CLIP-Image-Score for measuring the image-image similarity between the original and the reconstructed image generated by a text-to-image model using the caption. 3) human study on Amazon Mechanical Turk; 4) GPT-4V for fine-grained evaluation. Evaluation results show that VFC outperforms state-of-the-art open-sourced captioning methods for 2D images on the COCO dataset and 3D assets on the Objaverse dataset. Our study demonstrates that by combining open-source models into a pipeline, we can attain captioning capability comparable to proprietary models such as GPT-4V, despite being over 10x smaller in model size.

Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

TL;DR

VisualFactChecker tackles the persistent problem of hallucination and insufficient detail in automatic visual captioning by introducing a training-free pipeline that chains open-source captioners under an LLM-driven verifier with grounding tools. The three-stage process—propose, verify, and caption—applies consistently to both 2D images and 3D objects, while a novel CLIP-Image-Score metric uses reconstruction fidelity to assess caption quality. Across COCO and Objaverse, VFC achieves state-of-the-art performance among open-source methods and approaches, or matches, proprietary models like GPT-4V despite a substantially smaller footprint. The approach demonstrates that composing diverse open tools via LLM reasoning can deliver high-fidelity, instruction-following captions suitable for downstream tasks and evaluation, with strong evidence from human, GPT-4V, and reconstruction-based metrics.

Abstract

Existing automatic captioning methods for visual content face challenges such as lack of detail, content hallucination, and poor instruction following. In this work, we propose VisualFactChecker (VFC), a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text captioning models propose multiple initial captions; 2) verification, where a large language model (LLM) utilizes tools such as object detection and VQA models to fact-check proposed captions; 3) captioning, where an LLM generates the final caption by summarizing caption proposals and the fact check verification results. In this step, VFC can flexibly generate captions in various styles following complex instructions. We conduct comprehensive captioning evaluations using four metrics: 1) CLIP-Score for image-text similarity; 2) CLIP-Image-Score for measuring the image-image similarity between the original and the reconstructed image generated by a text-to-image model using the caption. 3) human study on Amazon Mechanical Turk; 4) GPT-4V for fine-grained evaluation. Evaluation results show that VFC outperforms state-of-the-art open-sourced captioning methods for 2D images on the COCO dataset and 3D assets on the Objaverse dataset. Our study demonstrates that by combining open-source models into a pipeline, we can attain captioning capability comparable to proprietary models such as GPT-4V, despite being over 10x smaller in model size.
Paper Structure (20 sections, 14 figures, 4 tables)

This paper contains 20 sections, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Comparison of VisualFactChecker (VFC) with GPT-4V and Cap3D. VFC can generate high-fidelity detailed captions that closely match GPT-4V's quality for 2D images and offer significantly more details for 3D objects than Cap3D. VFC used a pre-trained Llama-2 as the LLM when generating the caption for the above 2D image.
  • Figure 2: We use DALLE-3 dalle3 as a text-to-image model to reconstruct 2D images using generated captions from different captioning methods (BLIP-2, LLaVA-1.5 and ours). Similarly, we use MVDream shi2023mvdream as a text-to-3D model to reconstruct 3D objects using different 3D captions (generated by Cap3D luo2023cap3d and ours). From the results, we can see that the reconstructed images or 3D objects using BLIP-2 or Cap3D captions are less similar than the input ones, suggesting their captions may not contain sufficient information or incorrectly describe the visual contents; the reconstructed images using LLaVA-1.5 captions contain objects or scenes that are not present in the original images (top: people in the background, bottom: pedestrians and cars on the street), suggesting there might be hallucinations in LLaVA-1.5 captions. Images or 3D objects reconstructed using our captions are more similar to the inputs.
  • Figure 3: Pipeline of the VisualFactChecker for captioning 2D images (top) and 3D objects (bottom). The process begins with the input being captioned by two multimodal captioning models (Captioner-1 and Captioner-2) to generate preliminary captions. These captions are then verified using a Large Language Model (LLM) to call object detection (Detector) and VQA models for fact-checking the captions. Finally, the LLM incorporates all the results and summarizes the final caption by following instructions.
  • Figure 4: The CLIP-Image-Score pipeline evaluates caption accuracy by encoding an original image $X$ into a feature representation $I_X$ using a CLIP image encoder. A captioning model generates a caption that is then input into a text-to-image model to reconstruct an image $X'$, which is encoded to $I_{X'}$. The score is computed by assessing the cosine similarity between $I_X$ and $I_{X'}$, providing a measure of the caption's fidelity and hallucination detection.
  • Figure 5: 2D image captioning comparison with pair-wise winning rate. VisualFactChecker (VFC) outperforms all baseline methods on both CLIP-Score (top) and CLIP-Image-Score (bottom).
  • ...and 9 more figures