Table of Contents
Fetching ...

A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification

Qi You, Yitai Cheng, Zichao Zeng, James Haworth

TL;DR

CLIP-MHAdapter is proposed, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies to achieve superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset.

Abstract

Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction. It remains computationally demanding whether training from scratch, initialising from pre-trained weights, or fine-tuning large models. Although pre-trained vision-language models such as CLIP offer rich image representations, existing adaptation or fine-tuning methods often rely on their global image embeddings, limiting their ability to capture fine-grained, localised attributes essential in complex, cluttered street scenes. To address this, we propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies. With approximately 1.4 million trainable parameters, CLIP-MHAdapter achieves superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset, attaining new state-of-the-art results while maintaining low computational cost. The code is available at https://github.com/SpaceTimeLab/CLIP-MHAdapter.

A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification

TL;DR

CLIP-MHAdapter is proposed, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies to achieve superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset.

Abstract

Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction. It remains computationally demanding whether training from scratch, initialising from pre-trained weights, or fine-tuning large models. Although pre-trained vision-language models such as CLIP offer rich image representations, existing adaptation or fine-tuning methods often rely on their global image embeddings, limiting their ability to capture fine-grained, localised attributes essential in complex, cluttered street scenes. To address this, we propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies. With approximately 1.4 million trainable parameters, CLIP-MHAdapter achieves superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset, attaining new state-of-the-art results while maintaining low computational cost. The code is available at https://github.com/SpaceTimeLab/CLIP-MHAdapter.
Paper Structure (27 sections, 17 equations, 5 figures, 4 tables)

This paper contains 27 sections, 17 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The CLIP-MHAdapter framework. A visual MHAdapter module is integrated downstream of the pretrained image encoder, enabling task-specific adaptation while preserving the representational strengths of the pretrained CLIP backbone radford2021learning.
  • Figure 2: Details of the Multi-Head Feature Adaptation Module. Note that the input images are partitioned into standard 16 × 16 patches before being fed into the Vision Transformer encoder dosovitskiy2020image. A larger patch size is shown here solely for clarity of illustration.
  • Figure 3: Qualitative results of CLIP-MHAdapter. The attention maps from the MHSA layer in the Visual MHAdapter are overlaid on the original input images for visualization.
  • Figure 4: Confusion matrices of CLIP-MHAdapter on the GSS test set across eight attributes classification tasks. Each matrix corresponds to one attribute.
  • Figure 5: The class distribution for each SVI attribute of labelled GSS dataset hou2024global.