Table of Contents
Fetching ...

HSVLT: Hierarchical Scale-Aware Vision-Language Transformer for Multi-Label Image Classification

Shuyi Ouyang, Hongyi Wang, Ziwei Niu, Zhenjia Bai, Shiao Xie, Yingying Xu, Ruofeng Tong, Yen-Wei Chen, Lanfen Lin

TL;DR

HSVLT tackles multi‑label image classification by enabling tight cross‑modal alignment and multi‑scale reasoning through a four‑stage hierarchical Vision–Language Transformer. It introduces Interactive Visual–Linguistic Attention (IVLA) to update visual, linguistic and multi‑modal features within each scale and a Cross‑Scale Aggregation (CSA) module to fuse information across scales. The approach achieves state‑of‑the‑art results on VOC2007, COCO and NUS‑WIDE with lower computational cost, and ablation studies validate the contributions of IVLA and CSA. The work advances practical multi‑label recognition by combining strong cross‑modal interactions with scalable, multi‑scale reasoning.

Abstract

The task of multi-label image classification involves recognizing multiple objects within a single image. Considering both valuable semantic information contained in the labels and essential visual features presented in the image, tight visual-linguistic interactions play a vital role in improving classification performance. Moreover, given the potential variance in object size and appearance within a single image, attention to features of different scales can help to discover possible objects in the image. Recently, Transformer-based methods have achieved great success in multi-label image classification by leveraging the advantage of modeling long-range dependencies, but they have several limitations. Firstly, existing methods treat visual feature extraction and cross-modal fusion as separate steps, resulting in insufficient visual-linguistic alignment in the joint semantic space. Additionally, they only extract visual features and perform cross-modal fusion at a single scale, neglecting objects with different characteristics. To address these issues, we propose a Hierarchical Scale-Aware Vision-Language Transformer (HSVLT) with two appealing designs: (1)~A hierarchical multi-scale architecture that involves a Cross-Scale Aggregation module, which leverages joint multi-modal features extracted from multiple scales to recognize objects of varying sizes and appearances in images. (2)~Interactive Visual-Linguistic Attention, a novel attention mechanism module that tightly integrates cross-modal interaction, enabling the joint updating of visual, linguistic and multi-modal features. We have evaluated our method on three benchmark datasets. The experimental results demonstrate that HSVLT surpasses state-of-the-art methods with lower computational cost.

HSVLT: Hierarchical Scale-Aware Vision-Language Transformer for Multi-Label Image Classification

TL;DR

HSVLT tackles multi‑label image classification by enabling tight cross‑modal alignment and multi‑scale reasoning through a four‑stage hierarchical Vision–Language Transformer. It introduces Interactive Visual–Linguistic Attention (IVLA) to update visual, linguistic and multi‑modal features within each scale and a Cross‑Scale Aggregation (CSA) module to fuse information across scales. The approach achieves state‑of‑the‑art results on VOC2007, COCO and NUS‑WIDE with lower computational cost, and ablation studies validate the contributions of IVLA and CSA. The work advances practical multi‑label recognition by combining strong cross‑modal interactions with scalable, multi‑scale reasoning.

Abstract

The task of multi-label image classification involves recognizing multiple objects within a single image. Considering both valuable semantic information contained in the labels and essential visual features presented in the image, tight visual-linguistic interactions play a vital role in improving classification performance. Moreover, given the potential variance in object size and appearance within a single image, attention to features of different scales can help to discover possible objects in the image. Recently, Transformer-based methods have achieved great success in multi-label image classification by leveraging the advantage of modeling long-range dependencies, but they have several limitations. Firstly, existing methods treat visual feature extraction and cross-modal fusion as separate steps, resulting in insufficient visual-linguistic alignment in the joint semantic space. Additionally, they only extract visual features and perform cross-modal fusion at a single scale, neglecting objects with different characteristics. To address these issues, we propose a Hierarchical Scale-Aware Vision-Language Transformer (HSVLT) with two appealing designs: (1)~A hierarchical multi-scale architecture that involves a Cross-Scale Aggregation module, which leverages joint multi-modal features extracted from multiple scales to recognize objects of varying sizes and appearances in images. (2)~Interactive Visual-Linguistic Attention, a novel attention mechanism module that tightly integrates cross-modal interaction, enabling the joint updating of visual, linguistic and multi-modal features. We have evaluated our method on three benchmark datasets. The experimental results demonstrate that HSVLT surpasses state-of-the-art methods with lower computational cost.
Paper Structure (29 sections, 9 equations, 6 figures, 5 tables)

This paper contains 29 sections, 9 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison of existing Transformer-based architectures ((a) and (b)) for multi-label image classification with our HSVLT (c).
  • Figure 2: An illustration of HSVLT. The multi-scale joint vision-language encoder network is presented in (a). (b) shows the cross-scale aggregation module for multi-label classification. First, the input image $V_0$ and labels $L_0$ are sent to Joint Vision-Language Encoder. At the beginning of each stage, we down-sample visual features and unify the channel dimensions of visual and linguistic features. There are $N_i$ interaction blocks in $i$-th stage. Interaction blocks learn joint visual features $V_i, i \in \{ 1,2,3,4\}$, linguistic features $L_i, i \in \{ 1,2,3,4\}$ and multi-modal features $S_i, i \in \{ 1,2,3,4\}$, which contains local visual details and global visual-linguistic cues. $S_1,S_2,S_3,S_4$ are sent to the cross-scale aggregation module (b) for multi-label classification prediction.
  • Figure 3: (a) An illustration of the interaction block in the Joint Vision-Language Encoder. (b) An illustration of the Interactive Visual-Linguistic Attention.
  • Figure 4: Comparison of existing multi-scale integration structures ((a) and (b)) with the structure of our Cross-Scale Aggregation module (c).
  • Figure 5: Comparison of our HSVLT (w/o CSA), HSVLT with Transformer-based methods TSFormer zhu2022two, M3TR zhao2021m3tr and C-Tran lanchantin2021general on Params, GFLOPs and mAP on the Microsoft COCO test set.
  • ...and 1 more figures