Table of Contents
Fetching ...

Fine-Grained Scene Image Classification with Modality-Agnostic Adapter

Yiqun Wang, Zhao Zhou, Xiangcheng Du, Xingjiao Wu, Yingbin Zheng, Cheng Jin

TL;DR

The paper tackles fine-grained scene image classification by removing reliance on fixed modality priors and proposing a Modality-Agnostic Adapter (MAA) that equalizes modality distributions before semantic-level fusion with a modality-agnostic Transformer. By leveraging global ViT embeddings, text from KnowBert, and optional local visual cues, MAA learns the relative importance of each modality adaptively and can readily accommodate new modalities. Empirical results on Con-Text and Crowd Activity demonstrate state-of-the-art performance, with further gains when adding local embeddings; ablations confirm the necessity of independent modality alignment and the efficacy of the two-layer Transformer. This approach offers a scalable, flexible framework for multi-modal fusion in fine-grained scene understanding and is accompanied by publicly available code.

Abstract

When dealing with the task of fine-grained scene image classification, most previous works lay much emphasis on global visual features when doing multi-modal feature fusion. In other words, models are deliberately designed based on prior intuitions about the importance of different modalities. In this paper, we present a new multi-modal feature fusion approach named MAA (Modality-Agnostic Adapter), trying to make the model learn the importance of different modalities in different cases adaptively, without giving a prior setting in the model architecture. More specifically, we eliminate the modal differences in distribution and then use a modality-agnostic Transformer encoder for a semantic-level feature fusion. Our experiments demonstrate that MAA achieves state-of-the-art results on benchmarks by applying the same modalities with previous methods. Besides, it is worth mentioning that new modalities can be easily added when using MAA and further boost the performance. Code is available at https://github.com/quniLcs/MAA.

Fine-Grained Scene Image Classification with Modality-Agnostic Adapter

TL;DR

The paper tackles fine-grained scene image classification by removing reliance on fixed modality priors and proposing a Modality-Agnostic Adapter (MAA) that equalizes modality distributions before semantic-level fusion with a modality-agnostic Transformer. By leveraging global ViT embeddings, text from KnowBert, and optional local visual cues, MAA learns the relative importance of each modality adaptively and can readily accommodate new modalities. Empirical results on Con-Text and Crowd Activity demonstrate state-of-the-art performance, with further gains when adding local embeddings; ablations confirm the necessity of independent modality alignment and the efficacy of the two-layer Transformer. This approach offers a scalable, flexible framework for multi-modal fusion in fine-grained scene understanding and is accompanied by publicly available code.

Abstract

When dealing with the task of fine-grained scene image classification, most previous works lay much emphasis on global visual features when doing multi-modal feature fusion. In other words, models are deliberately designed based on prior intuitions about the importance of different modalities. In this paper, we present a new multi-modal feature fusion approach named MAA (Modality-Agnostic Adapter), trying to make the model learn the importance of different modalities in different cases adaptively, without giving a prior setting in the model architecture. More specifically, we eliminate the modal differences in distribution and then use a modality-agnostic Transformer encoder for a semantic-level feature fusion. Our experiments demonstrate that MAA achieves state-of-the-art results on benchmarks by applying the same modalities with previous methods. Besides, it is worth mentioning that new modalities can be easily added when using MAA and further boost the performance. Code is available at https://github.com/quniLcs/MAA.
Paper Structure (12 sections, 3 equations, 4 figures, 4 tables)

This paper contains 12 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Different multi-modal feature fusion strategy for fine-grained scene image classification. Global (local) refers to global (local) visual embeddings, and text refers to text embeddings. While previous methods (a,b) lay much emphasis on global visual features, ours treats all modalities equally (c).
  • Figure 2: Example images from Con-Text dataset karaoglu2013text. Images in the first and second rows are from class Bakery and Cafe, respectively. Images in the left column can be classified based on the text in the image, the middle column based on the object in the image, and the right column based on both clues.
  • Figure 3: The architecture of MAA. After getting the multi-modal feature embeddings ( e.g., global image embedding, local region embedding and text embedding), several multi-layer perceptrons (MLPs) are used to eliminate the modal differences. Then the modality-agnostic embeddings are sent into the modality-agnostic Transformer encoder for a semantic-level fusion.
  • Figure 4: Some examples of model prediction. Images in the first and second columns are from Con-Text karaoglu2013text, while the last column are from Crowd Activity wang2022knowledge. Texts in green and red refer to correct and incorrect predictions respectively.