Table of Contents
Fetching ...

LGCA: Enhancing Semantic Representation via Progressive Expansion

Thanh Hieu Cao, Trung Khang Tran, Gia Thinh Pham, Tuong Nghiem Diep, Thanh Binh Nguyen

TL;DR

LGCA tackles misinformation and bias arising from random cropping in CLIP-based zero-shot classification by introducing a Localized-Globalized Cross-Alignment framework. It first extracts local crops, then iteratively expands salient regions to integrate global context, and finally aggregates scores across expansion steps to form a robust image-text similarity. The authors provide a time-complexity analysis showing that the expansion steps do not significantly increase cost relative to a non-expanding baseline. Empirically, LGCA consistently outperforms state-of-the-art baselines across five datasets with two CLIP backbones, especially on fine-grained and complex scenes, demonstrating robustness and scalability for cross-modal zero-shot transfer.

Abstract

Recent advancements in large-scale pretraining in natural language processing have enabled pretrained vision-language models such as CLIP to effectively align images and text, significantly improving performance in zero-shot image classification tasks. Subsequent studies have further demonstrated that cropping images into smaller regions and using large language models to generate multiple descriptions for each caption can further enhance model performance. However, due to the inherent sensitivity of CLIP, random image crops can introduce misinformation and bias, as many images share similar features at small scales. To address this issue, we propose Localized-Globalized Cross-Alignment (LGCA), a framework that first captures the local features of an image and then repeatedly selects the most salient regions and expands them. The similarity score is designed to incorporate both the original and expanded images, enabling the model to capture both local and global features while minimizing misinformation. Additionally, we provide a theoretical analysis demonstrating that the time complexity of LGCA remains the same as that of the original model prior to the repeated expansion process, highlighting its efficiency and scalability. Extensive experiments demonstrate that our method substantially improves zero-shot performance across diverse datasets, outperforming state-of-the-art baselines.

LGCA: Enhancing Semantic Representation via Progressive Expansion

TL;DR

LGCA tackles misinformation and bias arising from random cropping in CLIP-based zero-shot classification by introducing a Localized-Globalized Cross-Alignment framework. It first extracts local crops, then iteratively expands salient regions to integrate global context, and finally aggregates scores across expansion steps to form a robust image-text similarity. The authors provide a time-complexity analysis showing that the expansion steps do not significantly increase cost relative to a non-expanding baseline. Empirically, LGCA consistently outperforms state-of-the-art baselines across five datasets with two CLIP backbones, especially on fine-grained and complex scenes, demonstrating robustness and scalability for cross-modal zero-shot transfer.

Abstract

Recent advancements in large-scale pretraining in natural language processing have enabled pretrained vision-language models such as CLIP to effectively align images and text, significantly improving performance in zero-shot image classification tasks. Subsequent studies have further demonstrated that cropping images into smaller regions and using large language models to generate multiple descriptions for each caption can further enhance model performance. However, due to the inherent sensitivity of CLIP, random image crops can introduce misinformation and bias, as many images share similar features at small scales. To address this issue, we propose Localized-Globalized Cross-Alignment (LGCA), a framework that first captures the local features of an image and then repeatedly selects the most salient regions and expands them. The similarity score is designed to incorporate both the original and expanded images, enabling the model to capture both local and global features while minimizing misinformation. Additionally, we provide a theoretical analysis demonstrating that the time complexity of LGCA remains the same as that of the original model prior to the repeated expansion process, highlighting its efficiency and scalability. Extensive experiments demonstrate that our method substantially improves zero-shot performance across diverse datasets, outperforming state-of-the-art baselines.

Paper Structure

This paper contains 16 sections, 1 theorem, 16 equations, 5 figures, 1 table.

Key Result

theorem thmcountertheorem

Consider a non-expanding model $\mathbf{Q}$. Let $I, L \in \mathbb{R}_{>0}$ be positive real numbers such that the time complexity of $\mathbf{Q}(\mathcal{I}, \mathcal{L})$ is given by $\mathcal{O}(H\times N^I \times M^L)$, for any image and caption datasets $\mathcal{I}$ and $\mathcal{L}$ where $N$

Figures (5)

  • Figure 1: An illustrative case when random cropping introduces misleading similarity. Consider an image of a Caspian Tern paired with a caption of a swan. The LLM-generated description for the swan includes the phrase “has an orange beak.” Due to random cropping, the model captures only the beak region of the Caspian Tern, which also appears orange. This results in a high similarity score of 0.72, thereby distorting the overall similarity assessment.
  • Figure 2: Local images are generated through cropping, with each crop weighted by its cosine similarity to the original image, indicating its level of correlation. In the illustration, the middle shows the original image, the left depicts low-correlation crops, and the right shows high-correlation crops.
  • Figure 3: Visualization of similarity scores of text descriptions to the prompt “A photo of Pineapple.” Longer green lines indicate higher relevance, while shorter red lines mark low or incorrect matches. Descriptions that are irrelevant or incorrect are highlighted in red for clarity.
  • Figure 4: Left: Visualization of an Expansion Step. Right: Visualization of our general pipeline with $T$ Expansion Steps.
  • Figure 5: Visualization of images through Expansion steps. In each row, we use an image from CUB_200_2011, Place365, and the DTD dataset, respectively. The leftmost column shows the original image. The second column presents the cropped region, and the subsequent columns illustrate the progressively expanded regions.

Theorems & Definitions (2)

  • theorem thmcountertheorem
  • proof