Table of Contents
Fetching ...

Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

Yanpeng Sun, Jing Hao, Ke Zhu, Jiang-Jiang Liu, Yuxiang Zhao, Xiaofan Li, Gang Zhang, Zechao Li, Jingdong Wang

TL;DR

The paper presents Descriptive Caption Enhancement Engine (DCE), a framework that augments image captions for large multimodal models by extracting rich object attributes and relations using open-source visual specialists and fusing them with region and relational information via large language models. By constructing two large annotated datasets, DCE-1M and DCE-118K, the approach demonstrates improved visual-language alignment and stronger performance across diverse VQA and multimodal benchmarks, while offering a cost-effective alternative to proprietary captioning pipelines. The work highlights the value of combining specialized visual information (fine-grained categories, depth, OCR, HOI, and 3D/2D spatial relations) with LLM-based caption synthesis, achieving richer, more contextually aware descriptions that translate into better downstream reasoning. It also discusses limitations such as OCR accuracy and detector noise, and points to future directions including multilingual captioning and broader integration of visual expertise.

Abstract

Training Large Multimodality Models (LMMs) relies on descriptive image caption that connects image and language. Existing methods either distill the caption from the LMM models or construct the captions from the internet images or by human. We propose to leverage off-the-shelf visual specialists, which were trained from annotated images initially not for image captioning, for enhancing the image caption. Our approach, named DCE, explores object low-level and fine-grained attributes (e.g., depth, emotion and fine-grained categories) and object relations (e.g., relative location and human-object-interaction (HOI)), and combine the attributes into the descriptive caption. Experiments demonstrate that such visual specialists are able to improve the performance for visual understanding tasks as well as reasoning that benefits from more accurate visual understanding. We will release the source code and the pipeline so that other visual specialists are easily combined into the pipeline. The complete source code of DCE pipeline and datasets will be available at \url{https://github.com/syp2ysy/DCE}.

Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

TL;DR

The paper presents Descriptive Caption Enhancement Engine (DCE), a framework that augments image captions for large multimodal models by extracting rich object attributes and relations using open-source visual specialists and fusing them with region and relational information via large language models. By constructing two large annotated datasets, DCE-1M and DCE-118K, the approach demonstrates improved visual-language alignment and stronger performance across diverse VQA and multimodal benchmarks, while offering a cost-effective alternative to proprietary captioning pipelines. The work highlights the value of combining specialized visual information (fine-grained categories, depth, OCR, HOI, and 3D/2D spatial relations) with LLM-based caption synthesis, achieving richer, more contextually aware descriptions that translate into better downstream reasoning. It also discusses limitations such as OCR accuracy and detector noise, and points to future directions including multilingual captioning and broader integration of visual expertise.

Abstract

Training Large Multimodality Models (LMMs) relies on descriptive image caption that connects image and language. Existing methods either distill the caption from the LMM models or construct the captions from the internet images or by human. We propose to leverage off-the-shelf visual specialists, which were trained from annotated images initially not for image captioning, for enhancing the image caption. Our approach, named DCE, explores object low-level and fine-grained attributes (e.g., depth, emotion and fine-grained categories) and object relations (e.g., relative location and human-object-interaction (HOI)), and combine the attributes into the descriptive caption. Experiments demonstrate that such visual specialists are able to improve the performance for visual understanding tasks as well as reasoning that benefits from more accurate visual understanding. We will release the source code and the pipeline so that other visual specialists are easily combined into the pipeline. The complete source code of DCE pipeline and datasets will be available at \url{https://github.com/syp2ysy/DCE}.

Paper Structure

This paper contains 12 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: (a) We present a comparison of captions from DCE, human, and generalist LMM models annotations, including InternVL2-26B, LLaVA-NeXT, and GPT-4V. (b) visualizes the extent to which the captions in (a) describe multiple objects and various attributes, including Objects 1-8, Object Attributes, OCR, HOI, 2D spatial relations and 3D spatial relations.
  • Figure 2: Comparisons of caption quality. (a) and (b) show the downstream task performance of LLaVA-v1.5 and LLaVA-NeXT after pretraining with different image captions.
  • Figure 3: The DCE pipeline first utilizes various visual specialists to extract both Object and Relation attributes. Then, it uses an LLM to integrate the object attributes into detailed region captions, followed by combining the region captions with relational attributes to generate a comprehensive image caption.
  • Figure 4: The prompt for using LLM to generate an region caption by considering object attributes and reference captions.
  • Figure 5: The prompt for LLM to generate an image caption by considering relation attributes, region location information and captions.
  • ...and 1 more figures