Table of Contents
Fetching ...

Unsupervised Part Discovery via Dual Representation Alignment

Jiahao Xia, Wenjian Huang, Min Xu, Jianguo Zhang, Haimin Zhang, Ziyu Sheng, Dong Xu

TL;DR

This work tackles unsupervised part discovery by introducing PartFormer, a transformer-based module with K+1 part embeddings, within a dual representation alignment framework. By training on paired images from the same source under geometric transformations and exchanging part representations, the model learns part-specific attention guided by geometric (concentration and area) and semantic (perceptual and ArcFace-based) constraints, producing dense part detectors that generalize across diverse datasets. The approach yields high-quality, semantically consistent part masks without supervision, validated on CelebA-in-the-wild, AFLW, CUB, DeepFashion, and PartImageNet, and benefits from pretraining and representation exchange for geometric invariance. These results indicate strong robustness and potential applicability to open-vocabulary or cross-domain part discovery tasks.

Abstract

Object parts serve as crucial intermediate representations in various downstream tasks, but part-level representation learning still has not received as much attention as other vision tasks. Previous research has established that Vision Transformer can learn instance-level attention without labels, extracting high-quality instance-level representations for boosting downstream tasks. In this paper, we achieve unsupervised part-specific attention learning using a novel paradigm and further employ the part representations to improve part discovery performance. Specifically, paired images are generated from the same image with different geometric transformations, and multiple part representations are extracted from these paired images using a novel module, named PartFormer. These part representations from the paired images are then exchanged to improve geometric transformation invariance. Subsequently, the part representations are aligned with the feature map extracted by a feature map encoder, achieving high similarity with the pixel representations of the corresponding part regions and low similarity in irrelevant regions. Finally, the geometric and semantic constraints are applied to the part representations through the intermediate results in alignment for part-specific attention learning, encouraging the PartFormer to focus locally and the part representations to explicitly include the information of the corresponding parts. Moreover, the aligned part representations can further serve as a series of reliable detectors in the testing phase, predicting pixel masks for part discovery. Extensive experiments are carried out on four widely used datasets, and our results demonstrate that the proposed method achieves competitive performance and robustness due to its part-specific attention.

Unsupervised Part Discovery via Dual Representation Alignment

TL;DR

This work tackles unsupervised part discovery by introducing PartFormer, a transformer-based module with K+1 part embeddings, within a dual representation alignment framework. By training on paired images from the same source under geometric transformations and exchanging part representations, the model learns part-specific attention guided by geometric (concentration and area) and semantic (perceptual and ArcFace-based) constraints, producing dense part detectors that generalize across diverse datasets. The approach yields high-quality, semantically consistent part masks without supervision, validated on CelebA-in-the-wild, AFLW, CUB, DeepFashion, and PartImageNet, and benefits from pretraining and representation exchange for geometric invariance. These results indicate strong robustness and potential applicability to open-vocabulary or cross-domain part discovery tasks.

Abstract

Object parts serve as crucial intermediate representations in various downstream tasks, but part-level representation learning still has not received as much attention as other vision tasks. Previous research has established that Vision Transformer can learn instance-level attention without labels, extracting high-quality instance-level representations for boosting downstream tasks. In this paper, we achieve unsupervised part-specific attention learning using a novel paradigm and further employ the part representations to improve part discovery performance. Specifically, paired images are generated from the same image with different geometric transformations, and multiple part representations are extracted from these paired images using a novel module, named PartFormer. These part representations from the paired images are then exchanged to improve geometric transformation invariance. Subsequently, the part representations are aligned with the feature map extracted by a feature map encoder, achieving high similarity with the pixel representations of the corresponding part regions and low similarity in irrelevant regions. Finally, the geometric and semantic constraints are applied to the part representations through the intermediate results in alignment for part-specific attention learning, encouraging the PartFormer to focus locally and the part representations to explicitly include the information of the corresponding parts. Moreover, the aligned part representations can further serve as a series of reliable detectors in the testing phase, predicting pixel masks for part discovery. Extensive experiments are carried out on four widely used datasets, and our results demonstrate that the proposed method achieves competitive performance and robustness due to its part-specific attention.
Paper Structure (18 sections, 10 equations, 15 figures, 11 tables)

This paper contains 18 sections, 10 equations, 15 figures, 11 tables.

Figures (15)

  • Figure 1: the self-attention maps to each object part and the unsupervised parts discovery results. The attention map is the self-attention averaged over the heads and layers, indicating the attention of each part embedding to image patch embeddings. The red, green, blue and pink area represent the part 1, part 2, part 3 and part 4 respectively.
  • Figure 2: The overall architecture for part-specific attention learning. It takes paired images ($\bm{I}^{(1)}$ and $\bm{I}^{(2)}$), generated from the same image $\bm{I}$ using random geometric transformations, as the inputs. A feature map encoder is utilized to extract dense feature maps $\bm{F}^{\left(n\right)}$, $n=1, 2$; a PartFormer is used to extract the part representations $\bm{G}^{\left(n\right)}$, $n=1, 2$, in a global manner; and a novel transfer module is designed for transferring part representations into dense feature maps. After exchanging the part representations from the paired images, the transfer module first calculates a probability map $\bm{V}^{\left(2 \rightarrow 1\right)}$ based on $\bm{F}^{\left(1\right)}$ and $\bm{G}^{\left(2\right)}$ using Hadamard product, and then normalizes it with a scaled softmax function. Finally, $\bm{G}^{\left(2\right)}$ is assigned to form a synthetic feature map $\bm{S}^{\left(2 \rightarrow 1\right)}$ according to $\bm{V}^{\left(2 \rightarrow 1\right)}$. Similarly, $\bm{S}^{\left(1 \rightarrow 2\right)}$ is generated from $\bm{F}^{\left(2\right)}$ and $\bm{G}^{\left(1\right)}$. Both $\bm{S}^{\left(1 \rightarrow 2\right)}$ and $\bm{S}^{\left(2 \rightarrow 1\right)}$ are fed into a decoder for reconstruction to form a closed loop for part-specific attention learning.
  • Figure 3: Overall pipeline for parts discovery in testing phase. The weights of the feature map encoder and PartFormer come from the two-stream architecture.
  • Figure 4: Visualized part discovery results of our proposed method and other state-of-the-art methods on CelebA-in-the-wild (K=8). Key: [Part1, Part 2, Part 3, Part 4, Part 5, Part 6, Part 7, Part 8]
  • Figure 5: Results of the parts discovery and the corresponding attention maps for different discovered parts on CelebA-in-the-wild (K=4/8, with/without pretraining). Key: [Part1, Part 2, Part 3, Part 4, Part 5, Part 6, Part 7, Part 8]
  • ...and 10 more figures