Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

Yanpeng Sun; Jing Hao; Ke Zhu; Jiang-Jiang Liu; Yuxiang Zhao; Xiaofan Li; Gang Zhang; Zechao Li; Jingdong Wang

Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

Yanpeng Sun, Jing Hao, Ke Zhu, Jiang-Jiang Liu, Yuxiang Zhao, Xiaofan Li, Gang Zhang, Zechao Li, Jingdong Wang

TL;DR

The paper presents Descriptive Caption Enhancement Engine (DCE), a framework that augments image captions for large multimodal models by extracting rich object attributes and relations using open-source visual specialists and fusing them with region and relational information via large language models. By constructing two large annotated datasets, DCE-1M and DCE-118K, the approach demonstrates improved visual-language alignment and stronger performance across diverse VQA and multimodal benchmarks, while offering a cost-effective alternative to proprietary captioning pipelines. The work highlights the value of combining specialized visual information (fine-grained categories, depth, OCR, HOI, and 3D/2D spatial relations) with LLM-based caption synthesis, achieving richer, more contextually aware descriptions that translate into better downstream reasoning. It also discusses limitations such as OCR accuracy and detector noise, and points to future directions including multilingual captioning and broader integration of visual expertise.

Abstract

Training Large Multimodality Models (LMMs) relies on descriptive image caption that connects image and language. Existing methods either distill the caption from the LMM models or construct the captions from the internet images or by human. We propose to leverage off-the-shelf visual specialists, which were trained from annotated images initially not for image captioning, for enhancing the image caption. Our approach, named DCE, explores object low-level and fine-grained attributes (e.g., depth, emotion and fine-grained categories) and object relations (e.g., relative location and human-object-interaction (HOI)), and combine the attributes into the descriptive caption. Experiments demonstrate that such visual specialists are able to improve the performance for visual understanding tasks as well as reasoning that benefits from more accurate visual understanding. We will release the source code and the pipeline so that other visual specialists are easily combined into the pipeline. The complete source code of DCE pipeline and datasets will be available at \url{https://github.com/syp2ysy/DCE}.

Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

TL;DR

Abstract

Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)