Table of Contents
Fetching ...

$S^3$: Synonymous Semantic Space for Improving Zero-Shot Generalization of Vision-Language Models

Xiaojie Yin, Qilong Wang, Bing Cao, Qinghua Hu

TL;DR

The paper tackles semantic misalignment in vision-language models caused by lexical variation by introducing a Synonymous Semantic Space ($S^3$) for each class. It generates multiple synonymous texts via large language models, builds a compact semantic space through a Vietoris-Rips complex and 0-dimensional persistent homology to identify the largest connected component, and employs several point-to-space metrics—most notably a Point-to-Local-Center approach—for robust zero-shot predictions. It extends to Test-Time Adaptation (TS^3) by adaptively shifting semantic spaces during inference to further improve alignment. Extensive experiments across 17 benchmarks show that $S^3$ and TS$^3$ achieve state-of-the-art performance on fine-grained zero-shot classification, natural distribution shifts, and open-vocabulary segmentation, with favorable cost-efficiency compared to prior methods.

Abstract

Recently, many studies have been conducted to enhance the zero-shot generalization ability of vision-language models (e.g., CLIP) by addressing the semantic misalignment between image and text embeddings in downstream tasks. Although many efforts have been made, existing methods barely consider the fact that a class of images can be described by notably different textual concepts due to well-known lexical variation in natural language processing, which heavily affects the zero-shot generalization of CLIP. Therefore, this paper proposes a \textbf{S}ynonymous \textbf{S}emantic \textbf{S}pace ($S^3$) for each image class, rather than relying on a single textual concept, achieving more stable semantic alignment and improving the zero-shot generalization of CLIP. Specifically, our $S^3$ method first generates several synonymous concepts based on the label of each class by using large language models, and constructs a continuous yet compact synonymous semantic space based on the Vietoris-Rips complex of the generated synonymous concepts. Furthermore, we explore the effect of several point-to-space metrics on our $S^3$, while presenting a point-to-local-center metric to compute similarity between image embeddings and the synonymous semantic space of each class, accomplishing effective zero-shot predictions. Extensive experiments are conducted across 17 benchmarks, including fine-grained zero-shot classification, natural distribution zero-shot classification, and open-vocabulary segmentation, and the results show that our $S^3$ outperforms state-of-the-art methods.

$S^3$: Synonymous Semantic Space for Improving Zero-Shot Generalization of Vision-Language Models

TL;DR

The paper tackles semantic misalignment in vision-language models caused by lexical variation by introducing a Synonymous Semantic Space () for each class. It generates multiple synonymous texts via large language models, builds a compact semantic space through a Vietoris-Rips complex and 0-dimensional persistent homology to identify the largest connected component, and employs several point-to-space metrics—most notably a Point-to-Local-Center approach—for robust zero-shot predictions. It extends to Test-Time Adaptation (TS^3) by adaptively shifting semantic spaces during inference to further improve alignment. Extensive experiments across 17 benchmarks show that and TS achieve state-of-the-art performance on fine-grained zero-shot classification, natural distribution shifts, and open-vocabulary segmentation, with favorable cost-efficiency compared to prior methods.

Abstract

Recently, many studies have been conducted to enhance the zero-shot generalization ability of vision-language models (e.g., CLIP) by addressing the semantic misalignment between image and text embeddings in downstream tasks. Although many efforts have been made, existing methods barely consider the fact that a class of images can be described by notably different textual concepts due to well-known lexical variation in natural language processing, which heavily affects the zero-shot generalization of CLIP. Therefore, this paper proposes a \textbf{S}ynonymous \textbf{S}emantic \textbf{S}pace () for each image class, rather than relying on a single textual concept, achieving more stable semantic alignment and improving the zero-shot generalization of CLIP. Specifically, our method first generates several synonymous concepts based on the label of each class by using large language models, and constructs a continuous yet compact synonymous semantic space based on the Vietoris-Rips complex of the generated synonymous concepts. Furthermore, we explore the effect of several point-to-space metrics on our , while presenting a point-to-local-center metric to compute similarity between image embeddings and the synonymous semantic space of each class, accomplishing effective zero-shot predictions. Extensive experiments are conducted across 17 benchmarks, including fine-grained zero-shot classification, natural distribution zero-shot classification, and open-vocabulary segmentation, and the results show that our outperforms state-of-the-art methods.

Paper Structure

This paper contains 18 sections, 6 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Comparison of Methods. (a) CLIP: Point-to-point similarity between image and label embeddings. (b) PE: Point-to-point similarity between image and single concept embeddings. (c) TTA: Point-to-point similarity between image and shifted text embeddings. (d) $S^3$ (Ours): Similarity between image and semantic spaces constructed from multiple synonymous concepts.
  • Figure 2: (a) Lexical variation in LAION-400M dataset: Images of the same class with very similar visual embeddings correspond to significantly different text embeddings, which may even belong to different textual concepts. (b) Compactness: image v.s. text: Image embeddings (blue) are consistently more compact than text embeddings (red). The original data (light color) has been smoothed. (c) Synonymous concepts form semantic spaces: Different synonymous concepts for a class form continuous, non-overlapping spaces.
  • Figure 3: Overall architecture of $S^3$. Given label of each class, our $S^3$ method generates synonymous texts by prompting LLMs, which are used to construct a synonymous semantic space by seeking the largest connected component in topological properties of semantic space. For a test image, similarities between image embedding and synonymous semantic spaces are calculated for zero-shot prediction.
  • Figure 4: Generating Synonymous Texts. A class name (e.g., "sunflower") and its dataset name (e.g., "flowers") are given as inputs to the LLMs through two prompts. The first generates synonyms (e.g., "sunflower", "helianthus"), and the second provides descriptors (e.g., "large, daisy-like flower"). These are then combined into synonymous texts (e.g., "A photo of a sunflower, which is a large, daisy-like flower").
  • Figure 5: Point-to-Space Similarity Metric: (a) Point-to-Set. (b) Point-to-Center. (c) Point-to-Subspace. (d) Point-to-Local-Center.
  • ...and 2 more figures