Table of Contents
Fetching ...

LLM-Driven Completeness and Consistency Evaluation for Cultural Heritage Data Augmentation in Cross-Modal Retrieval

Jian Zhang, Junyi Guo, Junyi Yuan, Huanda Lu, Yanlin Zhou, Fangyu Wu, Qiufeng Wang, Dongming Lu

TL;DR

The paper tackles the problem of incomplete and potentially hallucinated textual descriptions in cultural heritage cross-modal retrieval. It introduces C^3, an LLM-driven augmentation framework that enforces completeness via bidirectional coverage and consistency via a Markov decision process-guided Chain-of-Thought sequence, followed by contrastive learning with augmented captions. Key contributions include a formal completeness score S_complete, a CoT-based augmentation pipeline (C1–C4) with MD P supervision, and strong retrieval improvements on CulTi and TimeTravel, as well as competitive zero-shot results on MSCOCO and Flickr30K. The work demonstrates that grounding augmented descriptions in visual evidence and controlling reasoning steps can substantially reduce hallucinations while boosting multimodal alignment, with implications for digital preservation and museum analytics.

Abstract

Cross-modal retrieval is essential for interpreting cultural heritage data, but its effectiveness is often limited by incomplete or inconsistent textual descriptions, caused by historical data loss and the high cost of expert annotation. While large language models (LLMs) offer a promising solution by enriching textual descriptions, their outputs frequently suffer from hallucinations or miss visually grounded details. To address these challenges, we propose $C^3$, a data augmentation framework that enhances cross-modal retrieval performance by improving the completeness and consistency of LLM-generated descriptions. $C^3$ introduces a completeness evaluation module to assess semantic coverage using both visual cues and language-model outputs. Furthermore, to mitigate factual inconsistencies, we formulate a Markov Decision Process to supervise Chain-of-Thought reasoning, guiding consistency evaluation through adaptive query control. Experiments on the cultural heritage datasets CulTi and TimeTravel, as well as on general benchmarks MSCOCO and Flickr30K, demonstrate that $C^3$ achieves state-of-the-art performance in both fine-tuned and zero-shot settings.

LLM-Driven Completeness and Consistency Evaluation for Cultural Heritage Data Augmentation in Cross-Modal Retrieval

TL;DR

The paper tackles the problem of incomplete and potentially hallucinated textual descriptions in cultural heritage cross-modal retrieval. It introduces C^3, an LLM-driven augmentation framework that enforces completeness via bidirectional coverage and consistency via a Markov decision process-guided Chain-of-Thought sequence, followed by contrastive learning with augmented captions. Key contributions include a formal completeness score S_complete, a CoT-based augmentation pipeline (C1–C4) with MD P supervision, and strong retrieval improvements on CulTi and TimeTravel, as well as competitive zero-shot results on MSCOCO and Flickr30K. The work demonstrates that grounding augmented descriptions in visual evidence and controlling reasoning steps can substantially reduce hallucinations while boosting multimodal alignment, with implications for digital preservation and museum analytics.

Abstract

Cross-modal retrieval is essential for interpreting cultural heritage data, but its effectiveness is often limited by incomplete or inconsistent textual descriptions, caused by historical data loss and the high cost of expert annotation. While large language models (LLMs) offer a promising solution by enriching textual descriptions, their outputs frequently suffer from hallucinations or miss visually grounded details. To address these challenges, we propose , a data augmentation framework that enhances cross-modal retrieval performance by improving the completeness and consistency of LLM-generated descriptions. introduces a completeness evaluation module to assess semantic coverage using both visual cues and language-model outputs. Furthermore, to mitigate factual inconsistencies, we formulate a Markov Decision Process to supervise Chain-of-Thought reasoning, guiding consistency evaluation through adaptive query control. Experiments on the cultural heritage datasets CulTi and TimeTravel, as well as on general benchmarks MSCOCO and Flickr30K, demonstrate that achieves state-of-the-art performance in both fine-tuned and zero-shot settings.

Paper Structure

This paper contains 33 sections, 11 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Illustration of our $C^3$ with the general retrieval methods and the LLM-based caption augmentation methods.
  • Figure 2: Overview of the proposed $C^3$ framework. The pipeline first verifies image-text attributes extracted by an LLM with bidirectional attention and coverage scoring, then augments captions through a CoT framework and consistency evaluation process. Detailed captions are used to fine-tune a CLIP-based retrieval model for improved image-text aligning.
  • Figure 3: Case studies on completeness and consistency evaluation. The examples of captions and annotations are translated from Chinese to English for better understanding. Red denotes inaccurate or hallucinated descriptions; Green denotes missing details in the original caption.
  • Figure 4: (a) Retrieval performance on the CulTi dataset under both zero-shot and fine-tune conditions; (b) Comparisons of zero-shot retrieval performance between different VLLMs in CulTi.