A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification

Qi You; Yitai Cheng; Zichao Zeng; James Haworth

A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification

Qi You, Yitai Cheng, Zichao Zeng, James Haworth

TL;DR

CLIP-MHAdapter is proposed, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies to achieve superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset.

Abstract

Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction. It remains computationally demanding whether training from scratch, initialising from pre-trained weights, or fine-tuning large models. Although pre-trained vision-language models such as CLIP offer rich image representations, existing adaptation or fine-tuning methods often rely on their global image embeddings, limiting their ability to capture fine-grained, localised attributes essential in complex, cluttered street scenes. To address this, we propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies. With approximately 1.4 million trainable parameters, CLIP-MHAdapter achieves superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset, attaining new state-of-the-art results while maintaining low computational cost. The code is available at https://github.com/SpaceTimeLab/CLIP-MHAdapter.

A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification

TL;DR

Abstract

Paper Structure (27 sections, 17 equations, 5 figures, 4 tables)

This paper contains 27 sections, 17 equations, 5 figures, 4 tables.

Introduction
Related Work
Street-view Image Analysis
Vision Language Models
Adaptation Strategies for CLIP
Methodology
Global-Local Image Encoder
Multi-Head Feature Adaptation
Text Encoder
Imbalance-Aware Weighting
Inverse-Frequency Weighting:
Data
Global StreetScapes (GSS)
Data Pre-Processing
Experimental Setup
...and 12 more sections

Figures (5)

Figure 1: The CLIP-MHAdapter framework. A visual MHAdapter module is integrated downstream of the pretrained image encoder, enabling task-specific adaptation while preserving the representational strengths of the pretrained CLIP backbone radford2021learning.
Figure 2: Details of the Multi-Head Feature Adaptation Module. Note that the input images are partitioned into standard 16 × 16 patches before being fed into the Vision Transformer encoder dosovitskiy2020image. A larger patch size is shown here solely for clarity of illustration.
Figure 3: Qualitative results of CLIP-MHAdapter. The attention maps from the MHSA layer in the Visual MHAdapter are overlaid on the original input images for visualization.
Figure 4: Confusion matrices of CLIP-MHAdapter on the GSS test set across eight attributes classification tasks. Each matrix corresponds to one attribute.
Figure 5: The class distribution for each SVI attribute of labelled GSS dataset hou2024global.

A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification

TL;DR

Abstract

A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (5)