Table of Contents
Fetching ...

SCHNet: SAM Marries CLIP for Human Parsing

Kunliang Liu, Jianming Wang, Rize Jin, Wonjun Hwang, Tae-Sun Chung

TL;DR

This work tackles semantic-aware human parsing by integrating CLIP's semantic understanding with SAM's fine-grained segmentation. It introduces Semantic-Refinement Module (SRM) to inject multi-level CLIP semantics into SAM across all stages and a Fine-Tuning Module (FTM) that appends learnable tokens and applies a lightweight, shared MLP-based refinement to adapt SAM to human parsing. Together, these modules enable faster convergence and improved accuracy on Look into Person, Pascal-person-Part, and CIHP datasets, achieving state-of-the-art performance with notably reduced training time. The approach demonstrates the practical value of combining foundation models for domain-specific dense prediction tasks, offering a general recipe for efficient multi-modal adaptation in segmentation problems.

Abstract

Vision Foundation Model (VFM) such as the Segment Anything Model (SAM) and Contrastive Language-Image Pre-training Model (CLIP) has shown promising performance for segmentation and detection tasks. However, although SAM excels in fine-grained segmentation, it faces major challenges when applying it to semantic-aware segmentation. While CLIP exhibits a strong semantic understanding capability via aligning the global features of language and vision, it has deficiencies in fine-grained segmentation tasks. Human parsing requires to segment human bodies into constituent parts and involves both accurate fine-grained segmentation and high semantic understanding of each part. Based on traits of SAM and CLIP, we formulate high efficient modules to effectively integrate features of them to benefit human parsing. We propose a Semantic-Refinement Module to integrate semantic features of CLIP with SAM features to benefit parsing. Moreover, we formulate a high efficient Fine-tuning Module to adjust the pretrained SAM for human parsing that needs high semantic information and simultaneously demands spatial details, which significantly reduces the training time compared with full-time training and achieves notable performance. Extensive experiments demonstrate the effectiveness of our method on LIP, PPP, and CIHP databases.

SCHNet: SAM Marries CLIP for Human Parsing

TL;DR

This work tackles semantic-aware human parsing by integrating CLIP's semantic understanding with SAM's fine-grained segmentation. It introduces Semantic-Refinement Module (SRM) to inject multi-level CLIP semantics into SAM across all stages and a Fine-Tuning Module (FTM) that appends learnable tokens and applies a lightweight, shared MLP-based refinement to adapt SAM to human parsing. Together, these modules enable faster convergence and improved accuracy on Look into Person, Pascal-person-Part, and CIHP datasets, achieving state-of-the-art performance with notably reduced training time. The approach demonstrates the practical value of combining foundation models for domain-specific dense prediction tasks, offering a general recipe for efficient multi-modal adaptation in segmentation problems.

Abstract

Vision Foundation Model (VFM) such as the Segment Anything Model (SAM) and Contrastive Language-Image Pre-training Model (CLIP) has shown promising performance for segmentation and detection tasks. However, although SAM excels in fine-grained segmentation, it faces major challenges when applying it to semantic-aware segmentation. While CLIP exhibits a strong semantic understanding capability via aligning the global features of language and vision, it has deficiencies in fine-grained segmentation tasks. Human parsing requires to segment human bodies into constituent parts and involves both accurate fine-grained segmentation and high semantic understanding of each part. Based on traits of SAM and CLIP, we formulate high efficient modules to effectively integrate features of them to benefit human parsing. We propose a Semantic-Refinement Module to integrate semantic features of CLIP with SAM features to benefit parsing. Moreover, we formulate a high efficient Fine-tuning Module to adjust the pretrained SAM for human parsing that needs high semantic information and simultaneously demands spatial details, which significantly reduces the training time compared with full-time training and achieves notable performance. Extensive experiments demonstrate the effectiveness of our method on LIP, PPP, and CIHP databases.

Paper Structure

This paper contains 13 sections, 6 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Visual comparison among SAM, CLIP, and our SCHNet on human parsing. The regions highlighting the differences are marked with white circles. SAM make fine-grained segmentation without missing regions, but its outputs tend to be noisy, while CLIP provides coarse predictions, sometimes failing to parse entire important parts. Our method integrates the strengths of both methods, ensuring stable and reliable performance.
  • Figure 2: Architecture of SCHNet. $f_{\_txt}$: the text feature output by pre-trained text encoder of CLIP. $f_{\_cls}$: the class embedding feature output by pre-trained image encoder of CLIP. $f_{\_cv}^{\{1...4\}}$: the feature maps output by pre-trained image encoder of CLIP. We leverage feature maps of all blocks (from $1$ to $4$) of image encoder of CLIP. $f_{\_sim}$: the Similarity feature that is calculated from text feature and class embedding feature using $SimModule$. $Layer\_i$ and $Layer\_i+1$: means the $ith$ and $(i+1)th$ layers of SAM network. $f_{\_i}$ means the output feature maps of $ith$ layer of SAM. $f_{\_i}^{"}$: means the fine-tuned feature maps utilizing $FTM$. $f_{\_sv}^{0...4}$: means the output feature maps after patch embedding and $4$ blocks of SAM. $f_{\_sv}^{"0...4}$: means the semantic strengthened feature maps after SRM. We combine each stage semantic information of CLIP with each stage feature maps of SAM to improve the semantic-aware segmentation performance of pre-trained SAM fine-tuned by FTM module.
  • Figure 3: Overview of SimModule and SRM. (a) SimModule structure, (b) SRM module structure. $f_{\_cls}^{\{0...511\}}$: means class embedding feature output by CLIP image encoder, $\{0...511\}$ is the dimension range of class embedding feature. $f_{\_txt}^{\{0...511,0...19\}}$: is the text feature output by text encoder of CLIP, $\{0...511,0...19\}$ is the dimension range of the text feature. We employ the LIP dataset as an example. In LIP dataset, there exits $20$ categories of human parts. $\bigotimes$: matrix multiplication, $\bigoplus$:element-wise addition, $\bigodot$: element-wise multiplication, $\delta$: Softmax activation. $f_{\_sv}^{i-1}$: means feature maps from $(i-1)th$ stage of SAM.$f_{\_cv}^{i}$: means feature maps from $ith$ stage of CLIP. $f_{\_sim}$: means similarity between text and class embedding feature. $\uparrow$ and $\downarrow$: mean increase and decrease the channel dimension to what times of input dimension.
  • Figure 4: Overview of FTM. (a): module that is used to add learnable token information. (b): module that is leveraged to fine-tune the feature maps of SAM. $f_{i}$,$f_{i}^{'}$,$f_{i}^{"}$: mean feature maps from $ith$ layer of SAM, feature maps added with learnable tokens information, and fine-tuned feature maps that input to next layer of SAM, respectively. $T_{i}^{T}$:means the transposed $ith$ learnable tokens, $m\times c$ means the token dimension. $\bigotimes$,$\bigoplus$, $\delta$: mean matrix multiplication, element-wise addition, and Softmax activation, respectively. $\rho$ is a learnable parameter. $*$ means multiplication by a coefficient.