Table of Contents
Fetching ...

Segment and Caption Anything

Xiaoke Huang, Jianfeng Wang, Yansong Tang, Zheng Zhang, Han Hu, Jiwen Lu, Lijuan Wang, Zicheng Liu

TL;DR

A method to efficiently equip the Segment Anything Model (SAM) with the ability to generate regional captions by introducing a lightweight query-based feature mixer that aligns the region-specific features with the embedding space of language models for later caption generation.

Abstract

We propose a method to efficiently equip the Segment Anything Model (SAM) with the ability to generate regional captions. SAM presents strong generalizability to segment anything while is short for semantic understanding. By introducing a lightweight query-based feature mixer, we align the region-specific features with the embedding space of language models for later caption generation. As the number of trainable parameters is small (typically in the order of tens of millions), it costs less computation, less memory usage, and less communication bandwidth, resulting in both fast and scalable training. To address the scarcity problem of regional caption data, we propose to first pre-train our model on objection detection and segmentation tasks. We call this step weak supervision pretraining since the pre-training data only contains category names instead of full-sentence descriptions. The weak supervision pretraining allows us to leverage many publicly available object detection and segmentation datasets. We conduct extensive experiments to demonstrate the superiority of our method and validate each design choice. This work serves as a stepping stone towards scaling up regional captioning data and sheds light on exploring efficient ways to augment SAM with regional semantics. The project page, along with the associated code, can be accessed via https://xk-huang.github.io/segment-caption-anything/.

Segment and Caption Anything

TL;DR

A method to efficiently equip the Segment Anything Model (SAM) with the ability to generate regional captions by introducing a lightweight query-based feature mixer that aligns the region-specific features with the embedding space of language models for later caption generation.

Abstract

We propose a method to efficiently equip the Segment Anything Model (SAM) with the ability to generate regional captions. SAM presents strong generalizability to segment anything while is short for semantic understanding. By introducing a lightweight query-based feature mixer, we align the region-specific features with the embedding space of language models for later caption generation. As the number of trainable parameters is small (typically in the order of tens of millions), it costs less computation, less memory usage, and less communication bandwidth, resulting in both fast and scalable training. To address the scarcity problem of regional caption data, we propose to first pre-train our model on objection detection and segmentation tasks. We call this step weak supervision pretraining since the pre-training data only contains category names instead of full-sentence descriptions. The weak supervision pretraining allows us to leverage many publicly available object detection and segmentation datasets. We conduct extensive experiments to demonstrate the superiority of our method and validate each design choice. This work serves as a stepping stone towards scaling up regional captioning data and sheds light on exploring efficient ways to augment SAM with regional semantics. The project page, along with the associated code, can be accessed via https://xk-huang.github.io/segment-caption-anything/.
Paper Structure (9 sections, 3 equations, 3 figures, 7 tables)

This paper contains 9 sections, 3 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: SCA (b) is a lightweight augmentation of SAM (a) with the ability to generate regional captions. On top of SAM architecture, we add a pre-trained language model which is frozen, and a lightweight hybrid feature mixture. Despite the small number of trainable parameters, the region-specific features are learned to align with the embedding space of the language model for regional caption generation.
  • Figure 2: The model architecture. The model consists of three parts including an image encoder, a feature mixer, and decoder heads for masks or text. The key ingredient of the model is the text feature mixer, which is a lightweight bidirectional transformer vaswaniAttentionAllYou2023_transformers. We stack it over the one from SAM and reuse its tokens. By solely optimizing the additional mixer, we align the region-specific features with the embedding space of language models. The training is both fast and scalable thanks to the limited amount of optimizable parameters.
  • Figure 3: The qualitative results. SCA simultaneously predicts masks (in red contour) and captions. From top-to-bottom, the captions are from: SAM+Captioner {rgb]0.945, 0.960, 0.976GIT-large, rgb]1.0, 0.905, 0.909BLIP-large, rgb]0.933, 0.992, 0.933BLIP2-OPT-2.7B} wangCaptionAnythingInteractive2023a, rgb]1.0, 0.976, 0.784GRIT wuGRiTGenerativeRegiontotext2022a, SCA {rgb]0.996, 0.835, 0.698GPT2-large+VG, rgb]0.976, 0.909, 0.823LLAMA-3B+VG, rgb]0.886, 0.890, 0.976GPT2-large+Pretrain+VG}, and the ground truth. The bounding boxes (in red) are used to prompt the models. Zoom in for a better view.