Table of Contents
Fetching ...

MoColl: Agent-Based Specific and General Model Collaboration for Image Captioning

Pu Yang, Bin Dong

TL;DR

MoColl presents an agent-guided collaboration framework that unites a domain-specific VQA tool with a general LLM-based agent to tackle image captioning, with a focus on radiology report generation. The method decomposes captioning into a sequence of question-answer subtasks, where the VQA model handles domain visuals and the LLM agent plans, queries, and synthesizes captions, while also guiding VQA training through synthetic data generation and selection. A two-stage training procedure (warm-up and agent-guided tuning) and a three-step training loop (inference, selection, training) enable iterative improvement of the VQA tool, yielding state-of-the-art results on IU-Xray and MIMIC-CXR across standard captioning metrics. The work demonstrates the practical impact of integrating domain-specific visual analysis with comprehensive general knowledge, offering a pathway to more accurate, context-rich domain captions in complex multi-modal tasks.

Abstract

Image captioning is a critical task at the intersection of computer vision and natural language processing, with wide-ranging applications across various domains. For complex tasks such as diagnostic report generation, deep learning models require not only domain-specific image-caption datasets but also the incorporation of relevant general knowledge to provide contextual accuracy. Existing approaches exhibit inherent limitations: specialized models excel in capturing domain-specific details but lack generalization, while vision-language models (VLMs) built on large language models (LLMs) leverage general knowledge but struggle with domain-specific adaptation. To address these limitations, this paper proposes a novel agent-enhanced model collaboration framework, which we call MoColl, designed to effectively integrate domain-specific and general knowledge. Specifically, our approach is to decompose complex image captioning tasks into a series of interconnected question-answer subtasks. A trainable visual question answering (VQA) model is employed as a specialized tool to focus on domain-specific visual analysis, answering task-specific questions based on image content. Concurrently, an LLM-based agent with general knowledge formulates these questions and synthesizes the resulting question-answer pairs into coherent captions. Beyond its role in leveraging the VQA model, the agent further guides its training to enhance its domain-specific capabilities. Experimental results on radiology report generation validate the effectiveness of the proposed framework, demonstrating significant improvements in the quality of generated reports.

MoColl: Agent-Based Specific and General Model Collaboration for Image Captioning

TL;DR

MoColl presents an agent-guided collaboration framework that unites a domain-specific VQA tool with a general LLM-based agent to tackle image captioning, with a focus on radiology report generation. The method decomposes captioning into a sequence of question-answer subtasks, where the VQA model handles domain visuals and the LLM agent plans, queries, and synthesizes captions, while also guiding VQA training through synthetic data generation and selection. A two-stage training procedure (warm-up and agent-guided tuning) and a three-step training loop (inference, selection, training) enable iterative improvement of the VQA tool, yielding state-of-the-art results on IU-Xray and MIMIC-CXR across standard captioning metrics. The work demonstrates the practical impact of integrating domain-specific visual analysis with comprehensive general knowledge, offering a pathway to more accurate, context-rich domain captions in complex multi-modal tasks.

Abstract

Image captioning is a critical task at the intersection of computer vision and natural language processing, with wide-ranging applications across various domains. For complex tasks such as diagnostic report generation, deep learning models require not only domain-specific image-caption datasets but also the incorporation of relevant general knowledge to provide contextual accuracy. Existing approaches exhibit inherent limitations: specialized models excel in capturing domain-specific details but lack generalization, while vision-language models (VLMs) built on large language models (LLMs) leverage general knowledge but struggle with domain-specific adaptation. To address these limitations, this paper proposes a novel agent-enhanced model collaboration framework, which we call MoColl, designed to effectively integrate domain-specific and general knowledge. Specifically, our approach is to decompose complex image captioning tasks into a series of interconnected question-answer subtasks. A trainable visual question answering (VQA) model is employed as a specialized tool to focus on domain-specific visual analysis, answering task-specific questions based on image content. Concurrently, an LLM-based agent with general knowledge formulates these questions and synthesizes the resulting question-answer pairs into coherent captions. Beyond its role in leveraging the VQA model, the agent further guides its training to enhance its domain-specific capabilities. Experimental results on radiology report generation validate the effectiveness of the proposed framework, demonstrating significant improvements in the quality of generated reports.
Paper Structure (45 sections, 4 equations, 5 figures, 3 tables, 2 algorithms)

This paper contains 45 sections, 4 equations, 5 figures, 3 tables, 2 algorithms.

Figures (5)

  • Figure 1: Illustrative figures of three frameworks for image captioning.
  • Figure 2: Illustration of the training procedure of our method.
  • Figure 3: Ablation study on ICL. We show the ROUGE-L score with respect to the number of few-shot examples and two different example choice strategies. The left figure is for our MoColl framework with the aligned VLM, and the right one is for the MoColl framework with our agent-guided tuning algorithm. The baseline is the best score of all competing methods.
  • Figure 4: Ablation study on length of the chain of question-answer. We show the ROUGE-L score (blue) and BLEU1 score (red) with respect to the maximum number of questions an agent is allowed to ask.
  • Figure 5: Ablation study on the data size. We show the ROUGE-L score (blue) and training loss (red) of our agent-guided tuning process with respect to the the size of image-caption pairs. We also show the baseline ROUGE-L score of our model collaboration framework with the aligned VLM which is trained through the warm-up stage with the whole training dataset. This baseline indicates the captioning performance at the initial fine-tuning state.