Table of Contents
Fetching ...

Chain-of-Description: What I can understand, I can put into words

Jiaxin Guo, Daimeng Wei, Zongyao Li, Hengchao Shang, Yuanchang Luo, Hao Yang

TL;DR

This work introduces Chain-of-Description (CoD) prompting for multi-modal large language models, advocating a two-stage process where models first generate a detailed textual description of the input before answering. Across both Large Audio-Language Models (LALMs) and Large Vision-Language Models (LVLMs), CoD improves reasoning and alignment to ground-truth answers, with notable gains in hard or high-density information scenarios. Evaluations on AIR-Bench-Chat (audio) and MMMU_Pro (vision) show consistent performance improvements, including up to ~4% in speech and ~5.3% in hard visual tasks, with ablation analyses highlighting the role of information density and the benefits of higher-quality descriptions. The findings suggest that explicitly describing inputs can deepen model understanding and yield practically meaningful gains, while acknowledging the need for extensive multi-modal pretraining to maximize benefits. The approach offers a principled direction for enhancing multi-modal reasoning in open-source MLLMs and LVLMs, with implications for future research and benchmarking.

Abstract

In this paper, we propose a novel strategy defined as Chain-of-Description (CoD) Prompting, tailored for Multi-Modal Large Language Models. This approach involves having the model first provide a detailed description of the multi-modal input before generating an answer to the question. When applied to models such as Qwen2-Audio, Qwen2-VL, and Qwen2.5-VL, CoD Prompting significantly enhances performance compared to standard prompting methods. This is demonstrated by nearly a 4\% improvement in the speech category of the audio benchmark AIR-Bench-Chat and a 5.3\% improvement in the hard-level portion of the vision benchmark MMMU\_Pro. Our ablation study further validates the effectiveness of CoD Prompting.

Chain-of-Description: What I can understand, I can put into words

TL;DR

This work introduces Chain-of-Description (CoD) prompting for multi-modal large language models, advocating a two-stage process where models first generate a detailed textual description of the input before answering. Across both Large Audio-Language Models (LALMs) and Large Vision-Language Models (LVLMs), CoD improves reasoning and alignment to ground-truth answers, with notable gains in hard or high-density information scenarios. Evaluations on AIR-Bench-Chat (audio) and MMMU_Pro (vision) show consistent performance improvements, including up to ~4% in speech and ~5.3% in hard visual tasks, with ablation analyses highlighting the role of information density and the benefits of higher-quality descriptions. The findings suggest that explicitly describing inputs can deepen model understanding and yield practically meaningful gains, while acknowledging the need for extensive multi-modal pretraining to maximize benefits. The approach offers a principled direction for enhancing multi-modal reasoning in open-source MLLMs and LVLMs, with implications for future research and benchmarking.

Abstract

In this paper, we propose a novel strategy defined as Chain-of-Description (CoD) Prompting, tailored for Multi-Modal Large Language Models. This approach involves having the model first provide a detailed description of the multi-modal input before generating an answer to the question. When applied to models such as Qwen2-Audio, Qwen2-VL, and Qwen2.5-VL, CoD Prompting significantly enhances performance compared to standard prompting methods. This is demonstrated by nearly a 4\% improvement in the speech category of the audio benchmark AIR-Bench-Chat and a 5.3\% improvement in the hard-level portion of the vision benchmark MMMU\_Pro. Our ablation study further validates the effectiveness of CoD Prompting.

Paper Structure

This paper contains 24 sections, 3 figures, 9 tables.

Figures (3)

  • Figure 1: An example of using Standard Prompting and our Chain-of-Description (CoD) Prompting for Large Audio-Language Models (LALMs).
  • Figure 2: An example of using Standard Prompting and our Chain-of-Description (CoD) Prompting for Large Vision-Language Models (LVLMs).
  • Figure 3: A case image.