From Simple to Professional: A Combinatorial Controllable Image Captioning Agent
Xinran Wang, Muxi Diao, Baoteng Li, Haiwen Zhang, Kongming Liang, Zhanyu Ma
TL;DR
This work tackles the challenge of users providing only simple prompts for image captioning and presents CapAgent, a two-stage framework that first evolves these prompts into context-aware professional instructions using external web context and constraint specifications, then employs an agent with planning, retrieval, and a tool suite to generate compliant captions. The core idea is to convert user intent into a formalized instruction $s^ abla = A(s, c, x)$ and to execute captioning through a Thought-Action-Observation loop augmented by Retrieval-Augmented Planning and a set of specialized tools (including Visual Question Answering, sentiment editing, and length management). Key contributions include the instruction-evolving module with context integration, the CapAgent architecture with a reusable toolchain, and demonstrations of combinatorial caption control that meet format, semantic, lexical, and utility constraints. The approach enables precise, sentiment-aligned, keyword-rich, and structurally compliant captions, with potential impact on accessibility, content creation, and caption quality control in multimodal systems.
Abstract
The Controllable Image Captioning Agent (CapAgent) is an innovative system designed to bridge the gap between user simplicity and professional-level outputs in image captioning tasks. CapAgent automatically transforms user-provided simple instructions into detailed, professional instructions, enabling precise and context-aware caption generation. By leveraging multimodal large language models (MLLMs) and external tools such as object detection tool and search engines, the system ensures that captions adhere to specified guidelines, including sentiment, keywords, focus, and formatting. CapAgent transparently controls each step of the captioning process, and showcases its reasoning and tool usage at every step, fostering user trust and engagement. The project code is available at https://github.com/xin-ran-w/CapAgent.
