Table of Contents
Fetching ...

Panoptic Captioning: An Equivalence Bridge for Image and Text

Kun-Yu Lin, Hongjun Wang, Weining Ren, Kai Han

TL;DR

This work introduces panoptic captioning as the minimum text description that covers all entities, locations, attributes, relations, and global state in an image. It formalizes five semantic dimensions, proposes PancapScore for comprehensive evaluation, and builds the SA-Pancap benchmark alongside PancapEngine to generate high-quality data. To tackle the task, PancapChain decouples caption generation into four stages (Loc, Tag, Disc, Cap), enabling stepwise grounding and description. Empirical results show PancapChain-13B surpasses several strong MLLMs on PancapScore and improves image-text retrieval, underscoring the practical utility of panoptic captions while highlighting ongoing gaps to achieve true minimum text equivalence between vision and language modalities.

Abstract

This work introduces panoptic captioning, a novel task striving to seek the minimum text equivalent of images, which has broad potential applications. We take the first step towards panoptic captioning by formulating it as a task of generating a comprehensive textual description for an image, which encapsulates all entities, their respective locations and attributes, relationships among entities, as well as global image state. Through an extensive evaluation, our work reveals that state-of-the-art Multi-modal Large Language Models (MLLMs) have limited performance in solving panoptic captioning. To address this, we propose an effective data engine named PancapEngine to produce high-quality data and a novel method named PancapChain to improve panoptic captioning. Specifically, our PancapEngine first detects diverse categories of entities in images by an elaborate detection suite, and then generates required panoptic captions using entity-aware prompts. Additionally, our PancapChain explicitly decouples the challenging panoptic captioning task into multiple stages and generates panoptic captions step by step. More importantly, we contribute a comprehensive metric named PancapScore and a human-curated test set for reliable model evaluation. Experiments show that our PancapChain-13B model can beat state-of-the-art open-source MLLMs like InternVL-2.5-78B and even surpass proprietary models like GPT-4o and Gemini-2.0-Pro, demonstrating the effectiveness of our data engine and method. Project page: https://visual-ai.github.io/pancap/

Panoptic Captioning: An Equivalence Bridge for Image and Text

TL;DR

This work introduces panoptic captioning as the minimum text description that covers all entities, locations, attributes, relations, and global state in an image. It formalizes five semantic dimensions, proposes PancapScore for comprehensive evaluation, and builds the SA-Pancap benchmark alongside PancapEngine to generate high-quality data. To tackle the task, PancapChain decouples caption generation into four stages (Loc, Tag, Disc, Cap), enabling stepwise grounding and description. Empirical results show PancapChain-13B surpasses several strong MLLMs on PancapScore and improves image-text retrieval, underscoring the practical utility of panoptic captions while highlighting ongoing gaps to achieve true minimum text equivalence between vision and language modalities.

Abstract

This work introduces panoptic captioning, a novel task striving to seek the minimum text equivalent of images, which has broad potential applications. We take the first step towards panoptic captioning by formulating it as a task of generating a comprehensive textual description for an image, which encapsulates all entities, their respective locations and attributes, relationships among entities, as well as global image state. Through an extensive evaluation, our work reveals that state-of-the-art Multi-modal Large Language Models (MLLMs) have limited performance in solving panoptic captioning. To address this, we propose an effective data engine named PancapEngine to produce high-quality data and a novel method named PancapChain to improve panoptic captioning. Specifically, our PancapEngine first detects diverse categories of entities in images by an elaborate detection suite, and then generates required panoptic captions using entity-aware prompts. Additionally, our PancapChain explicitly decouples the challenging panoptic captioning task into multiple stages and generates panoptic captions step by step. More importantly, we contribute a comprehensive metric named PancapScore and a human-curated test set for reliable model evaluation. Experiments show that our PancapChain-13B model can beat state-of-the-art open-source MLLMs like InternVL-2.5-78B and even surpass proprietary models like GPT-4o and Gemini-2.0-Pro, demonstrating the effectiveness of our data engine and method. Project page: https://visual-ai.github.io/pancap/

Paper Structure

This paper contains 25 sections, 2 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Top Left: An example to demonstrate our proposed panoptic captioning task, which is formulated as generating a comprehensive textual description encapsulating all entities, their respective locations and attributes, relationships among entities, as well as global image state for a given image. Top Right: We report several models' performance on four distinct dimensions on the validation and test sets of our SA-Pancap benchmark. The figure shows that three state-of-the-art Multi-modal Large Language Models (MLLMs) struggle with panoptic captioning, while our proposed PancapChain performs generally better with a significantly smaller model size. Bottom: Image "reconstruction" by the text-to-image model PixArt-$\Sigma$DBLP:conf/eccv/ChenGXWYRWLLL24 with different types of captions. Best viewed in color.
  • Figure 2: An overview of our proposed PancapScore metric. PancapScore first extracts semantic content from captions, and then evaluates model performance by entity instance matching and instance-aware question answering (QA).
  • Figure 3: An overview of our proposed PancapChain method. PancapChain explicitly decouples the challenging panoptic captioning task into four stages, namely entity instance localization, semantic tag assignment, extra instance discovery and panoptic caption generation.
  • Figure 4: Image reconstruction using PixArt-$\Sigma$DBLP:conf/eccv/ChenGXWYRWLLL24. Compared with baseline models, our PancapChain can better capture image details, and thus lead to better image reconstruction.
  • Figure A1: Image reconstruction using PixArt-$\Sigma$DBLP:conf/eccv/ChenGXWYRWLLL24. Compared with previous models, our PancapChain can better capture image details, and thus lead to better image reconstruction. Best viewed in color.
  • ...and 2 more figures