Table of Contents
Fetching ...

Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions

Renjie Pi, Jianshu Zhang, Jipeng Zhang, Rui Pan, Zhekai Chen, Tong Zhang

TL;DR

This work tackles the limitation of existing image descriptions being either noisy from web data or shallow from human labeling. It introduces Image Textualization, a three-phase framework that combines multimodal large language models, vision expert models, and LLM-based recaptioning to produce accurate, richly detailed captions with reduced hallucinations. The authors establish three benchmarks (DID-Bench, D2I-Bench, LIN-Bench) and a large IT-generated dataset (IT-170K) to evaluate quality, completeness, and linguistic aspects, and demonstrate that IT-generated data enhances MLLM tuning and downstream performance. The results show substantial improvements over baseline MLLMs in descriptive quality, image-consistency, and reduced hallucinations, signaling practical gains for image understanding, generation, and retrieval tasks, with open releases to foster future research.

Abstract

Image description datasets play a crucial role in the advancement of various applications such as image understanding, text-to-image generation, and text-image retrieval. Currently, image description datasets primarily originate from two sources. One source is the scraping of image-text pairs from the web. Despite their abundance, these descriptions are often of low quality and noisy. Another is through human labeling. Datasets such as COCO are generally very short and lack details. Although detailed image descriptions can be annotated by humans, the high annotation cost limits the feasibility. These limitations underscore the need for more efficient and scalable methods to generate accurate and detailed image descriptions. In this paper, we propose an innovative framework termed Image Textualization (IT), which automatically produces high-quality image descriptions by leveraging existing multi-modal large language models (MLLMs) and multiple vision expert models in a collaborative manner, which maximally convert the visual information into text. To address the current lack of benchmarks for detailed descriptions, we propose several benchmarks for comprehensive evaluation, which verifies the quality of image descriptions created by our framework. Furthermore, we show that LLaVA-7B, benefiting from training on IT-curated descriptions, acquire improved capability to generate richer image descriptions, substantially increasing the length and detail of their output with less hallucination.

Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions

TL;DR

This work tackles the limitation of existing image descriptions being either noisy from web data or shallow from human labeling. It introduces Image Textualization, a three-phase framework that combines multimodal large language models, vision expert models, and LLM-based recaptioning to produce accurate, richly detailed captions with reduced hallucinations. The authors establish three benchmarks (DID-Bench, D2I-Bench, LIN-Bench) and a large IT-generated dataset (IT-170K) to evaluate quality, completeness, and linguistic aspects, and demonstrate that IT-generated data enhances MLLM tuning and downstream performance. The results show substantial improvements over baseline MLLMs in descriptive quality, image-consistency, and reduced hallucinations, signaling practical gains for image understanding, generation, and retrieval tasks, with open releases to foster future research.

Abstract

Image description datasets play a crucial role in the advancement of various applications such as image understanding, text-to-image generation, and text-image retrieval. Currently, image description datasets primarily originate from two sources. One source is the scraping of image-text pairs from the web. Despite their abundance, these descriptions are often of low quality and noisy. Another is through human labeling. Datasets such as COCO are generally very short and lack details. Although detailed image descriptions can be annotated by humans, the high annotation cost limits the feasibility. These limitations underscore the need for more efficient and scalable methods to generate accurate and detailed image descriptions. In this paper, we propose an innovative framework termed Image Textualization (IT), which automatically produces high-quality image descriptions by leveraging existing multi-modal large language models (MLLMs) and multiple vision expert models in a collaborative manner, which maximally convert the visual information into text. To address the current lack of benchmarks for detailed descriptions, we propose several benchmarks for comprehensive evaluation, which verifies the quality of image descriptions created by our framework. Furthermore, we show that LLaVA-7B, benefiting from training on IT-curated descriptions, acquire improved capability to generate richer image descriptions, substantially increasing the length and detail of their output with less hallucination.
Paper Structure (34 sections, 8 figures, 13 tables, 2 algorithms)

This paper contains 34 sections, 8 figures, 13 tables, 2 algorithms.

Figures (8)

  • Figure 1: Visualization of our Image Textualization. Compared with the MLLM-generated description, our description incorporates more visual details and significantly less hallucinations. The shared details, newly added details, hallucinations, and positional descriptions are all marked with different colors.
  • Figure 2: The framework of Image Textualization (IT), which consists of three phases: (A) Holistic Textualization (Sec. \ref{['sec: phase1']}) utilizes a MLLM to generate a "Reference Description" that provides a basic structure; (B) Visual Detail Textualization (Sec. \ref{['sec: phase2']}) identifies the hallucinations and captures details in the image via a variety of vision experts, then transforms them to text format. (C) Textualized Recaptioning (Sec. \ref{['sec: phase3']}), which leverages LLM and textualized results from (A) and (B) to re-generate the image captions that are both rich in details and free from hallucination.
  • Figure 3: D2I-Bench visualization. IT-generated descriptions capture more fine-grained image details, which leads to generated images more similar to the original images.
  • Figure 4: D2I-Bench Results.
  • Figure 5: Comparison with results generated without using fine-grained annotation and in-context examples. We
  • ...and 3 more figures