Table of Contents
Fetching ...

Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception

Ruotian Peng, Haiying He, Yake Wei, Yandong Wen, Di Hu

TL;DR

This paper tackles the problem that multimodal large language models often generate captions lacking fine-grained detail and are prone to hallucinations. It introduces a training-free divide-then-aggregate pipeline that splits images into spatial and semantic patches, generates patch-level descriptions, and hierarchically aggregates them with semantic filtering to produce a detailed and reliable global caption. The approach leverages lightweight visual experts (OVDet, BLIPv2) and LLMs (via LLaMA-3.1) to enable patch-based perception without retraining, and it demonstrates robust,-wide applicability across open-source and closed-source MLLMs on benchmarks such as DID-Bench, D2I-Bench, and DetailCaps, with substantial gains in metrics like CIDEr and METEOR and reduced hallucinations. Overall, Patch Matters offers a scalable, training-free path to richer, more faithful image captions, improving downstream multimodal tasks and cross-modal interactions while minimizing model retraining costs.

Abstract

High-quality image captions play a crucial role in improving the performance of cross-modal applications such as text-to-image generation, text-to-video generation, and text-image retrieval. To generate long-form, high-quality captions, many recent studies have employed multimodal large language models (MLLMs). However, current MLLMs often produce captions that lack fine-grained details or suffer from hallucinations, a challenge that persists in both open-source and closed-source models. Inspired by Feature-Integration theory, which suggests that attention must focus on specific regions to integrate visual information effectively, we propose a \textbf{divide-then-aggregate} strategy. Our method first divides the image into semantic and spatial patches to extract fine-grained details, enhancing the model's local perception of the image. These local details are then hierarchically aggregated to generate a comprehensive global description. To address hallucinations and inconsistencies in the generated captions, we apply a semantic-level filtering process during hierarchical aggregation. This training-free pipeline can be applied to both open-source models (LLaVA-1.5, LLaVA-1.6, Mini-Gemini) and closed-source models (Claude-3.5-Sonnet, GPT-4o, GLM-4V-Plus). Extensive experiments demonstrate that our method generates more detailed, reliable captions, advancing multimodal description generation without requiring model retraining. The source code are available at https://github.com/GeWu-Lab/Patch-Matters

Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception

TL;DR

This paper tackles the problem that multimodal large language models often generate captions lacking fine-grained detail and are prone to hallucinations. It introduces a training-free divide-then-aggregate pipeline that splits images into spatial and semantic patches, generates patch-level descriptions, and hierarchically aggregates them with semantic filtering to produce a detailed and reliable global caption. The approach leverages lightweight visual experts (OVDet, BLIPv2) and LLMs (via LLaMA-3.1) to enable patch-based perception without retraining, and it demonstrates robust,-wide applicability across open-source and closed-source MLLMs on benchmarks such as DID-Bench, D2I-Bench, and DetailCaps, with substantial gains in metrics like CIDEr and METEOR and reduced hallucinations. Overall, Patch Matters offers a scalable, training-free path to richer, more faithful image captions, improving downstream multimodal tasks and cross-modal interactions while minimizing model retraining costs.

Abstract

High-quality image captions play a crucial role in improving the performance of cross-modal applications such as text-to-image generation, text-to-video generation, and text-image retrieval. To generate long-form, high-quality captions, many recent studies have employed multimodal large language models (MLLMs). However, current MLLMs often produce captions that lack fine-grained details or suffer from hallucinations, a challenge that persists in both open-source and closed-source models. Inspired by Feature-Integration theory, which suggests that attention must focus on specific regions to integrate visual information effectively, we propose a \textbf{divide-then-aggregate} strategy. Our method first divides the image into semantic and spatial patches to extract fine-grained details, enhancing the model's local perception of the image. These local details are then hierarchically aggregated to generate a comprehensive global description. To address hallucinations and inconsistencies in the generated captions, we apply a semantic-level filtering process during hierarchical aggregation. This training-free pipeline can be applied to both open-source models (LLaVA-1.5, LLaVA-1.6, Mini-Gemini) and closed-source models (Claude-3.5-Sonnet, GPT-4o, GLM-4V-Plus). Extensive experiments demonstrate that our method generates more detailed, reliable captions, advancing multimodal description generation without requiring model retraining. The source code are available at https://github.com/GeWu-Lab/Patch-Matters

Paper Structure

This paper contains 29 sections, 7 equations, 12 figures, 17 tables.

Figures (12)

  • Figure 1: An illustration of current challenges faced by MLLMs in generating accurate image descriptions. Our approach provides a description with enhanced visual details and significantly reduced hallucinated content. Shared information, newly added details and hallucinations are highlighted in different colors for clarity.
  • Figure 2: Our train-free divide-then-aggregate pipeline. (A) Overview: We divide the image into spatial and semantic patches to capture finer details (\ref{['Sec3.2:Patch Slicing']}), and perform hierarchical aggregation (intra-patch and inter-patch) to generate detailed and reliable image captions (\ref{['Sec3.3:Hierarchical aggregation']}). (B) Semantic Filtering: During aggregation, candidate descriptions are classified into same, contradictory, and unique categories, which are then consolidated into a coherent caption to mitigate hallucination and prevent information conflict. (C) Inter-patch Aggregation: When fusing descriptions from different patches, we use IoU to determine assess whether semantic patch enhancement is required to incorporate global information, and whether Semantic Filtering is needed to prevent conflicts.
  • Figure 3: The visualization of D2I-Bench shows that our method is able to capture more image details, as well as object attributes and relationships, resulting in generated images that are more similar to the original images.
  • Figure 4: The performance of different methods and models on the VQA task shows that our method achieves the best results.
  • Figure 5: The system and user prompts used for semantic filtering query.
  • ...and 7 more figures