LGCA: Enhancing Semantic Representation via Progressive Expansion

Thanh Hieu Cao; Trung Khang Tran; Gia Thinh Pham; Tuong Nghiem Diep; Thanh Binh Nguyen

LGCA: Enhancing Semantic Representation via Progressive Expansion

Thanh Hieu Cao, Trung Khang Tran, Gia Thinh Pham, Tuong Nghiem Diep, Thanh Binh Nguyen

TL;DR

LGCA tackles misinformation and bias arising from random cropping in CLIP-based zero-shot classification by introducing a Localized-Globalized Cross-Alignment framework. It first extracts local crops, then iteratively expands salient regions to integrate global context, and finally aggregates scores across expansion steps to form a robust image-text similarity. The authors provide a time-complexity analysis showing that the expansion steps do not significantly increase cost relative to a non-expanding baseline. Empirically, LGCA consistently outperforms state-of-the-art baselines across five datasets with two CLIP backbones, especially on fine-grained and complex scenes, demonstrating robustness and scalability for cross-modal zero-shot transfer.

Abstract

Recent advancements in large-scale pretraining in natural language processing have enabled pretrained vision-language models such as CLIP to effectively align images and text, significantly improving performance in zero-shot image classification tasks. Subsequent studies have further demonstrated that cropping images into smaller regions and using large language models to generate multiple descriptions for each caption can further enhance model performance. However, due to the inherent sensitivity of CLIP, random image crops can introduce misinformation and bias, as many images share similar features at small scales. To address this issue, we propose Localized-Globalized Cross-Alignment (LGCA), a framework that first captures the local features of an image and then repeatedly selects the most salient regions and expands them. The similarity score is designed to incorporate both the original and expanded images, enabling the model to capture both local and global features while minimizing misinformation. Additionally, we provide a theoretical analysis demonstrating that the time complexity of LGCA remains the same as that of the original model prior to the repeated expansion process, highlighting its efficiency and scalability. Extensive experiments demonstrate that our method substantially improves zero-shot performance across diverse datasets, outperforming state-of-the-art baselines.

LGCA: Enhancing Semantic Representation via Progressive Expansion

TL;DR

Abstract

LGCA: Enhancing Semantic Representation via Progressive Expansion

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (2)