Table of Contents
Fetching ...

LoDisc: Learning Global-Local Discriminative Features for Self-Supervised Fine-Grained Visual Recognition

Jialu Shi, Zhiqiang Wei, Jie Nie, Lei Huang

TL;DR

LoDisc addresses FGVR under a pure self-supervised regime by coupling a global contrastive branch with a local discrimination branch that focuses on local pivotal regions. Local discrimination uses attention-derived pivotal patches and a location-wise mask sampling strategy to create local views and a dedicated contrastive objective, while the global branch follows MoCo v3. The overall objective is $L = L_g + L_l$, enabling joint optimization of global and local features. Experiments on FGVC-Aircraft, Stanford Cars, CUB-200-2011, and Caltech-101 demonstrate state-of-the-art performance in both classification and retrieval, with substantial gains over baselines and good generalization to general object recognition.

Abstract

The self-supervised contrastive learning strategy has attracted considerable attention due to its exceptional ability in representation learning. However, current contrastive learning tends to learn global coarse-grained representations of the image that benefit generic object recognition, whereas such coarse-grained features are insufficient for fine-grained visual recognition. In this paper, we incorporate subtle local fine-grained feature learning into global self-supervised contrastive learning through a pure self-supervised global-local fine-grained contrastive learning framework. Specifically, a novel pretext task called local discrimination (LoDisc) is proposed to explicitly supervise the self-supervised model's focus toward local pivotal regions, which are captured by a simple but effective location-wise mask sampling strategy. We show that the LoDisc pretext task can effectively enhance fine-grained clues in important local regions and that the global-local framework further refines the fine-grained feature representations of images. Extensive experimental results on different fine-grained object recognition tasks demonstrate that the proposed method can lead to a decent improvement in different evaluation settings. The proposed method is also effective for general object recognition tasks.

LoDisc: Learning Global-Local Discriminative Features for Self-Supervised Fine-Grained Visual Recognition

TL;DR

LoDisc addresses FGVR under a pure self-supervised regime by coupling a global contrastive branch with a local discrimination branch that focuses on local pivotal regions. Local discrimination uses attention-derived pivotal patches and a location-wise mask sampling strategy to create local views and a dedicated contrastive objective, while the global branch follows MoCo v3. The overall objective is , enabling joint optimization of global and local features. Experiments on FGVC-Aircraft, Stanford Cars, CUB-200-2011, and Caltech-101 demonstrate state-of-the-art performance in both classification and retrieval, with substantial gains over baselines and good generalization to general object recognition.

Abstract

The self-supervised contrastive learning strategy has attracted considerable attention due to its exceptional ability in representation learning. However, current contrastive learning tends to learn global coarse-grained representations of the image that benefit generic object recognition, whereas such coarse-grained features are insufficient for fine-grained visual recognition. In this paper, we incorporate subtle local fine-grained feature learning into global self-supervised contrastive learning through a pure self-supervised global-local fine-grained contrastive learning framework. Specifically, a novel pretext task called local discrimination (LoDisc) is proposed to explicitly supervise the self-supervised model's focus toward local pivotal regions, which are captured by a simple but effective location-wise mask sampling strategy. We show that the LoDisc pretext task can effectively enhance fine-grained clues in important local regions and that the global-local framework further refines the fine-grained feature representations of images. Extensive experimental results on different fine-grained object recognition tasks demonstrate that the proposed method can lead to a decent improvement in different evaluation settings. The proposed method is also effective for general object recognition tasks.
Paper Structure (28 sections, 6 equations, 9 figures, 8 tables, 1 algorithm)

This paper contains 28 sections, 6 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: Global-local fine-grained visual recognition multi-branch framework. Sample pairs are augmentations of images, and local contrastive branch's inputs are generated through the augmentation of these images. Global contrastive branches learn global discriminative features based on InstDisc in a self-supervised vision transformer (MoCo v3), which has an encoder and a momentum encoder. Local contrastive branches learn local discriminative features via LoDisc, which shares a momentum encoder.
  • Figure 2: Comparison between two pretext tasks. Positive sample pairs originate from the same sample, such as q and k+. Negative sample pairs originate from different samples, such as q and k-. The two pretext tasks differ in their inputs k. (a) q, k+, and k- are global views from the augmentations of images. (b) Only q is the global view, while both k+ and k- are local regions from the augmentations of images. Contrastive loss maximizes the similarity between positive sample pairs (red solid line) and the dissimilarity between negative sample pairs (green dashed line) to learn discriminative features.
  • Figure 3: Overview of the proposed method. In the global branches, MoCo v3 is used to learn global coarse-grained discriminative features. To specifically gain local fine-grained features, a pretext task LoDisc is proposed to supervise local branches in learning local fine-grained discriminative features, and it is organized by a local pivotal region collection and selection module and a local discriminative feature learning module. In the first module, the proposed method collects the attention weights of each layer in the momentum encoder, and a location-wise mask sampling strategy is developed to selectively keep local pivotal regions. The second module shares the momentum encoder and learns local discriminative features in pivotal regions. During global coarse-grained learning and local fine-grained learning, the model is optimized by global contrastive loss and local contrastive loss.
  • Figure 4: Mask sampling strategies determine whether local regions are selected, and they influence the quality of the representations extracted by the feature extractor. Each subgraph corresponds to a different strategy. (a) is the original image, (b) shows a random masking strategy, (c) shows a grid-wise masking strategy, (d) shows a border-wise masking strategy, and (e) shows the proposed location-wise masking strategy. Here, the masking ratio for all masking strategies is 70%.
  • Figure 5: Performance of the border-wise masking strategy in our global-local framework is evaluated on FGVC-Aircraft at different masking ratios.
  • ...and 4 more figures