Table of Contents
Fetching ...

From Simple to Professional: A Combinatorial Controllable Image Captioning Agent

Xinran Wang, Muxi Diao, Baoteng Li, Haiwen Zhang, Kongming Liang, Zhanyu Ma

TL;DR

This work tackles the challenge of users providing only simple prompts for image captioning and presents CapAgent, a two-stage framework that first evolves these prompts into context-aware professional instructions using external web context and constraint specifications, then employs an agent with planning, retrieval, and a tool suite to generate compliant captions. The core idea is to convert user intent into a formalized instruction $s^ abla = A(s, c, x)$ and to execute captioning through a Thought-Action-Observation loop augmented by Retrieval-Augmented Planning and a set of specialized tools (including Visual Question Answering, sentiment editing, and length management). Key contributions include the instruction-evolving module with context integration, the CapAgent architecture with a reusable toolchain, and demonstrations of combinatorial caption control that meet format, semantic, lexical, and utility constraints. The approach enables precise, sentiment-aligned, keyword-rich, and structurally compliant captions, with potential impact on accessibility, content creation, and caption quality control in multimodal systems.

Abstract

The Controllable Image Captioning Agent (CapAgent) is an innovative system designed to bridge the gap between user simplicity and professional-level outputs in image captioning tasks. CapAgent automatically transforms user-provided simple instructions into detailed, professional instructions, enabling precise and context-aware caption generation. By leveraging multimodal large language models (MLLMs) and external tools such as object detection tool and search engines, the system ensures that captions adhere to specified guidelines, including sentiment, keywords, focus, and formatting. CapAgent transparently controls each step of the captioning process, and showcases its reasoning and tool usage at every step, fostering user trust and engagement. The project code is available at https://github.com/xin-ran-w/CapAgent.

From Simple to Professional: A Combinatorial Controllable Image Captioning Agent

TL;DR

This work tackles the challenge of users providing only simple prompts for image captioning and presents CapAgent, a two-stage framework that first evolves these prompts into context-aware professional instructions using external web context and constraint specifications, then employs an agent with planning, retrieval, and a tool suite to generate compliant captions. The core idea is to convert user intent into a formalized instruction and to execute captioning through a Thought-Action-Observation loop augmented by Retrieval-Augmented Planning and a set of specialized tools (including Visual Question Answering, sentiment editing, and length management). Key contributions include the instruction-evolving module with context integration, the CapAgent architecture with a reusable toolchain, and demonstrations of combinatorial caption control that meet format, semantic, lexical, and utility constraints. The approach enables precise, sentiment-aligned, keyword-rich, and structurally compliant captions, with potential impact on accessibility, content creation, and caption quality control in multimodal systems.

Abstract

The Controllable Image Captioning Agent (CapAgent) is an innovative system designed to bridge the gap between user simplicity and professional-level outputs in image captioning tasks. CapAgent automatically transforms user-provided simple instructions into detailed, professional instructions, enabling precise and context-aware caption generation. By leveraging multimodal large language models (MLLMs) and external tools such as object detection tool and search engines, the system ensures that captions adhere to specified guidelines, including sentiment, keywords, focus, and formatting. CapAgent transparently controls each step of the captioning process, and showcases its reasoning and tool usage at every step, fostering user trust and engagement. The project code is available at https://github.com/xin-ran-w/CapAgent.

Paper Structure

This paper contains 19 sections, 2 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: When a user needs to describe an image, he/she may need to write more complex and professional instructions to achieve the effect he/she wants. However, writing such long and complex instructions is not easy for the user. Our CapAgent can automatically make the user instructions more professional and follow the evolved instructions to generate more professional captions.
  • Figure 2: Process of instruction evolution with context information extracted from the webpage by using Google Lens and Google Search.
  • Figure 3: The diagram of CapAgent's workflow.
  • Figure 4: Professional instruction and caption examples of CapAgent (Page 1).
  • Figure 5: Professional instruction and caption examples of CapAgent (Page 2).