Table of Contents
Fetching ...

SSPA: Split-and-Synthesize Prompting with Gated Alignments for Multi-Label Image Recognition

Hao Tan, Zichang Tan, Jun Li, Jun Wan, Zhen Lei, Stan Z. Li

TL;DR

SSPA tackles open-vocabulary multi-label image recognition by leveraging knowledge from large language models through in-context prompting and a split-and-synthesize strategy. It introduces a quaternion-based fusion (QSM) to combine generic knowledge, downstream label semantics, and visual context, and a gated dual-modal alignment (GDMA) to bidirectionally and efficiently align text and region features with a soft region aggregator. The approach yields state-of-the-art results across natural, pedestrian attribute, and remote sensing datasets, while analyses confirm the effectiveness of SSP components and the interpretability of gate mechanisms. This work advances open-domain MLR by tightly integrating linguistic knowledge with region-level visual representations in a flexible, generalizable framework.

Abstract

Multi-label image recognition is a fundamental task in computer vision. Recently, Vision-Language Models (VLMs) have made notable advancements in this area. However, previous methods fail to effectively leverage the rich knowledge in language models and often incorporate label semantics into visual features unidirectionally. To overcome these problems, we propose a Split-and-Synthesize Prompting with Gated Alignments (SSPA) framework to amplify the potential of VLMs. Specifically, we develop an in-context learning approach to associate the inherent knowledge from LLMs. Then we propose a novel Split-and-Synthesize Prompting (SSP) strategy to first model the generic knowledge and downstream label semantics individually and then aggregate them carefully through the quaternion network. Moreover, we present Gated Dual-Modal Alignments (GDMA) to bidirectionally interact visual and linguistic modalities while eliminating redundant cross-modal information, enabling more efficient region-level alignments. Rather than making the final prediction by a sharp manner in previous works, we propose a soft aggregator to jointly consider results from all image regions. With the help of flexible prompting and gated alignments, SSPA is generalizable to specific domains. Extensive experiments on nine datasets from three domains (i.e., natural, pedestrian attributes and remote sensing) demonstrate the state-of-the-art performance of SSPA. Further analyses verify the effectiveness of SSP and the interpretability of GDMA. The code will be made public.

SSPA: Split-and-Synthesize Prompting with Gated Alignments for Multi-Label Image Recognition

TL;DR

SSPA tackles open-vocabulary multi-label image recognition by leveraging knowledge from large language models through in-context prompting and a split-and-synthesize strategy. It introduces a quaternion-based fusion (QSM) to combine generic knowledge, downstream label semantics, and visual context, and a gated dual-modal alignment (GDMA) to bidirectionally and efficiently align text and region features with a soft region aggregator. The approach yields state-of-the-art results across natural, pedestrian attribute, and remote sensing datasets, while analyses confirm the effectiveness of SSP components and the interpretability of gate mechanisms. This work advances open-domain MLR by tightly integrating linguistic knowledge with region-level visual representations in a flexible, generalizable framework.

Abstract

Multi-label image recognition is a fundamental task in computer vision. Recently, Vision-Language Models (VLMs) have made notable advancements in this area. However, previous methods fail to effectively leverage the rich knowledge in language models and often incorporate label semantics into visual features unidirectionally. To overcome these problems, we propose a Split-and-Synthesize Prompting with Gated Alignments (SSPA) framework to amplify the potential of VLMs. Specifically, we develop an in-context learning approach to associate the inherent knowledge from LLMs. Then we propose a novel Split-and-Synthesize Prompting (SSP) strategy to first model the generic knowledge and downstream label semantics individually and then aggregate them carefully through the quaternion network. Moreover, we present Gated Dual-Modal Alignments (GDMA) to bidirectionally interact visual and linguistic modalities while eliminating redundant cross-modal information, enabling more efficient region-level alignments. Rather than making the final prediction by a sharp manner in previous works, we propose a soft aggregator to jointly consider results from all image regions. With the help of flexible prompting and gated alignments, SSPA is generalizable to specific domains. Extensive experiments on nine datasets from three domains (i.e., natural, pedestrian attributes and remote sensing) demonstrate the state-of-the-art performance of SSPA. Further analyses verify the effectiveness of SSP and the interpretability of GDMA. The code will be made public.
Paper Structure (19 sections, 19 equations, 8 figures, 12 tables)

This paper contains 19 sections, 19 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Paradigm comparison. (a) Previous methods chen2019learningyou2020crosswang2020multizhu2022twozhu2023scene adopt pure category names or plain templates to extract text features. Then unidirectional interaction is applied to aggregate label semantics and $C$ classifiers are trained for recognition. (b) Our method employs novel split-and-synthesize prompting, where we extract generic knowledge from LLM and downstream semantics individually and then aggregate them in quaternion space. With the help of gated bidirectional interaction, SSPA can efficiently align the text features and regional features. No extra classifiers are required.
  • Figure 2: Overview of the proposed SSPA framework. The global branch directly compares global visual features with text features. The regional branch performs more fine-grained alignments between regional features and label semantics. We develop a Split-and-Synthesize Prompting (SSP) pipeline to get holistic label representations, where we concatenate LLM prompts with templates to get knowledge-aware text embeddings, and introduce learnable prompts and Dynamic Semantic Filtering (DSF) module to get context-aware text embeddings. Then we synthesize them through Quaternion Semantic Modeling (QSM) module. To mutually interact text embeddings and visual features while filtering out redundant cross-modal signals, we propose Gated Dual-Modal Alignments (GDMA), which efficiently aligns regional features with label semantics and achieves input-adaptive category centers during inference. The final scores are predicted based on our soft aggregator.
  • Figure 3: Our prompts to LLM. The text prompts the LLM to associate pertinent knowledge about shapes, sizes, colors and possible label relationships. Through domain description and in-context examples, LLM can be seamlessly linked to different domains. We also control the conciseness and structure of the answers to enable automatic processing.
  • Figure 4: The proposed gated visual-to-semantic attention. The output of cross-modal attention is gated by the learned gate vector to filter out redundant signals. Gated semantic-to-visual attention is of symmetric structure.
  • Figure 5: Ablation study (%) on the global-regional framework and the soft aggregator in regional branch. "G+R" denotes the framework using both global and regional branches. "Hard" denotes using a hard aggregator and "Average" means simply averaging the results from different regions.
  • ...and 3 more figures