MoColl: Agent-Based Specific and General Model Collaboration for Image Captioning
Pu Yang, Bin Dong
TL;DR
MoColl presents an agent-guided collaboration framework that unites a domain-specific VQA tool with a general LLM-based agent to tackle image captioning, with a focus on radiology report generation. The method decomposes captioning into a sequence of question-answer subtasks, where the VQA model handles domain visuals and the LLM agent plans, queries, and synthesizes captions, while also guiding VQA training through synthetic data generation and selection. A two-stage training procedure (warm-up and agent-guided tuning) and a three-step training loop (inference, selection, training) enable iterative improvement of the VQA tool, yielding state-of-the-art results on IU-Xray and MIMIC-CXR across standard captioning metrics. The work demonstrates the practical impact of integrating domain-specific visual analysis with comprehensive general knowledge, offering a pathway to more accurate, context-rich domain captions in complex multi-modal tasks.
Abstract
Image captioning is a critical task at the intersection of computer vision and natural language processing, with wide-ranging applications across various domains. For complex tasks such as diagnostic report generation, deep learning models require not only domain-specific image-caption datasets but also the incorporation of relevant general knowledge to provide contextual accuracy. Existing approaches exhibit inherent limitations: specialized models excel in capturing domain-specific details but lack generalization, while vision-language models (VLMs) built on large language models (LLMs) leverage general knowledge but struggle with domain-specific adaptation. To address these limitations, this paper proposes a novel agent-enhanced model collaboration framework, which we call MoColl, designed to effectively integrate domain-specific and general knowledge. Specifically, our approach is to decompose complex image captioning tasks into a series of interconnected question-answer subtasks. A trainable visual question answering (VQA) model is employed as a specialized tool to focus on domain-specific visual analysis, answering task-specific questions based on image content. Concurrently, an LLM-based agent with general knowledge formulates these questions and synthesizes the resulting question-answer pairs into coherent captions. Beyond its role in leveraging the VQA model, the agent further guides its training to enhance its domain-specific capabilities. Experimental results on radiology report generation validate the effectiveness of the proposed framework, demonstrating significant improvements in the quality of generated reports.
