CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models

Kiet A. Nguyen; Adheesh Juvekar; Tianjiao Yu; Muntasir Wahed; Ismini Lourentzou

CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models

Kiet A. Nguyen, Adheesh Juvekar, Tianjiao Yu, Muntasir Wahed, Ismini Lourentzou

TL;DR

Calico is presented, the first LVLM designed for multi-image part-level reasoning segmentation, and features two key components, a novel Correspondence Extraction Module that identifies semantic part-level correspondences, and Correspondence Adaptation Modules that embed this information into the LVLM to facilitate multi-image understanding in a parameter-efficient manner.

Abstract

Recent advances in Large Vision-Language Models (LVLMs) have enabled general-purpose vision tasks through visual instruction tuning. While existing LVLMs can generate segmentation masks from text prompts for single images, they struggle with segmentation-grounded reasoning across images, especially at finer granularities such as object parts. In this paper, we introduce the new task of part-focused semantic co-segmentation, which involves identifying and segmenting common objects, as well as common and unique object parts across images. To address this task, we present CALICO, the first LVLM designed for multi-image part-level reasoning segmentation. CALICO features two key components, a novel Correspondence Extraction Module that identifies semantic part-level correspondences, and Correspondence Adaptation Modules that embed this information into the LVLM to facilitate multi-image understanding in a parameter-efficient manner. To support training and evaluation, we curate MixedParts, a large-scale multi-image segmentation dataset containing $\sim$2.4M samples across $\sim$44K images spanning diverse object and part categories. Experimental results demonstrate that CALICO, with just 0.3% of its parameters finetuned, achieves strong performance on this challenging task.

CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models

TL;DR

Abstract

2.4M samples across

44K images spanning diverse object and part categories. Experimental results demonstrate that CALICO, with just 0.3% of its parameters finetuned, achieves strong performance on this challenging task.

Paper Structure (30 sections, 10 equations, 14 figures, 10 tables)

This paper contains 30 sections, 10 equations, 14 figures, 10 tables.

Introduction
Related Work
Prompting Image Segmentation
Part Segmentation
Object/Part Co-Segmentation
Method
Problem Definition
Calico Architecture
Correspondence Extraction Module (CEM)
Correspondence Adaptation Module (CAM)
Training Objective
MixedParts Dataset
Experiments
Experimental Results
Ablations
...and 15 more sections

Figures (14)

Figure 1: Multi-Image Part-focused Object Comparison with Calico. Our pixel-grounded Large Vision-Language Model, Calico, performs part-focused semantic co-segmentation, a newly introduced task where the goal is to identify, segment, and label common objects, as well as common and unique object parts across multiple images.
Figure 2: Calico Efficiency.Calico improves performance (e.g., recall) in part-focused co-segmentation while reducing TFLOPS by $\sim$32-35% and accelerating inference by $\sim$30-51% compared to SotA baselines, using 8-18$\times$ fewer image tokens.
Figure 3: Overview of the Calico Architecture for Part-Focused Semantic Co-Segmentation.Calico employs a Q-Former cross-attention module to query efficient image embeddings from a pretrained image encoder, which are passed as visual tokens into a Vicuna-based LLM. We extract [SEG] tokens from the output text, which are used to prompt a SAM decoder to produce corresponding segmentation masks. We propose two modules: the Correspondence Extraction Module (CEM), which captures semantic-rich part correspondences, and Correspondence Adaptation Modules (CAMs), which inject this information into the LVLM. CEM/CAM details in Figure \ref{['fig:cemcam']}.
Figure 4: Overview of our Correspondence Extraction and Correspondence Adaptation Modules. In Calico, $k$ CAMs are placed at every $\frac{N}{k}$ layers in the $N$-layered LLM.
Figure 5: Example Image Pairs in MixedParts with Common Objects, Common Parts, and Unique Parts segmented and labeled. Each column represents a different image pair, derived from a set of diverse datasets with various levels of detail, PACO-LVIS, PartImageNet, and ADE20K-Part-234, covering both rigid and non-rigid objects and parts. Each image pair is displayed across 3 rows to illustrate (i) the (possibly common or different) object(s), (ii) the common object part(s), and (iii) the unique object part(s) in each pair.
...and 9 more figures

CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models

TL;DR

Abstract

CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (14)