LoDisc: Learning Global-Local Discriminative Features for Self-Supervised Fine-Grained Visual Recognition
Jialu Shi, Zhiqiang Wei, Jie Nie, Lei Huang
TL;DR
LoDisc addresses FGVR under a pure self-supervised regime by coupling a global contrastive branch with a local discrimination branch that focuses on local pivotal regions. Local discrimination uses attention-derived pivotal patches and a location-wise mask sampling strategy to create local views and a dedicated contrastive objective, while the global branch follows MoCo v3. The overall objective is $L = L_g + L_l$, enabling joint optimization of global and local features. Experiments on FGVC-Aircraft, Stanford Cars, CUB-200-2011, and Caltech-101 demonstrate state-of-the-art performance in both classification and retrieval, with substantial gains over baselines and good generalization to general object recognition.
Abstract
The self-supervised contrastive learning strategy has attracted considerable attention due to its exceptional ability in representation learning. However, current contrastive learning tends to learn global coarse-grained representations of the image that benefit generic object recognition, whereas such coarse-grained features are insufficient for fine-grained visual recognition. In this paper, we incorporate subtle local fine-grained feature learning into global self-supervised contrastive learning through a pure self-supervised global-local fine-grained contrastive learning framework. Specifically, a novel pretext task called local discrimination (LoDisc) is proposed to explicitly supervise the self-supervised model's focus toward local pivotal regions, which are captured by a simple but effective location-wise mask sampling strategy. We show that the LoDisc pretext task can effectively enhance fine-grained clues in important local regions and that the global-local framework further refines the fine-grained feature representations of images. Extensive experimental results on different fine-grained object recognition tasks demonstrate that the proposed method can lead to a decent improvement in different evaluation settings. The proposed method is also effective for general object recognition tasks.
