Table of Contents
Fetching ...

LOBG:Less Overfitting for Better Generalization in Vision-Language Model

Chenhao Ding, Xinyuan Gao, Songlin Dong, Yuhang He, Qiang Wang, Alex Kot, Yihong Gong

TL;DR

A framework named LOBG for vision-language models is proposed, which uses CLIP to filter out fine-grained foreground information that might cause overfitting, and develops a structural topology preservation (STP) loss at the feature level, which endows the feature space with overall plasticity, allowing effective reshaping of the feature space during optimization.

Abstract

Existing prompt learning methods in Vision-Language Models (VLM) have effectively enhanced the transfer capability of VLM to downstream tasks, but they suffer from a significant decline in generalization due to severe overfitting. To address this issue, we propose a framework named LOBG for vision-language models. Specifically, we use CLIP to filter out fine-grained foreground information that might cause overfitting, thereby guiding prompts with basic visual concepts. To further mitigate overfitting, we devel oped a structural topology preservation (STP) loss at the feature level, which endows the feature space with overall plasticity, allowing effective reshaping of the feature space during optimization. Additionally, we employed hierarchical logit distilation (HLD) at the output level to constrain outputs, complementing STP at the output end. Extensive experimental results demonstrate that our method significantly improves generalization capability and alleviates overfitting compared to state-of-the-art approaches.

LOBG:Less Overfitting for Better Generalization in Vision-Language Model

TL;DR

A framework named LOBG for vision-language models is proposed, which uses CLIP to filter out fine-grained foreground information that might cause overfitting, and develops a structural topology preservation (STP) loss at the feature level, which endows the feature space with overall plasticity, allowing effective reshaping of the feature space during optimization.

Abstract

Existing prompt learning methods in Vision-Language Models (VLM) have effectively enhanced the transfer capability of VLM to downstream tasks, but they suffer from a significant decline in generalization due to severe overfitting. To address this issue, we propose a framework named LOBG for vision-language models. Specifically, we use CLIP to filter out fine-grained foreground information that might cause overfitting, thereby guiding prompts with basic visual concepts. To further mitigate overfitting, we devel oped a structural topology preservation (STP) loss at the feature level, which endows the feature space with overall plasticity, allowing effective reshaping of the feature space during optimization. Additionally, we employed hierarchical logit distilation (HLD) at the output level to constrain outputs, complementing STP at the output end. Extensive experimental results demonstrate that our method significantly improves generalization capability and alleviates overfitting compared to state-of-the-art approaches.

Paper Structure

This paper contains 14 sections, 15 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: (a) Attention maps of CLIP and CoOp. Overfitting leads to more attention on fine details of base classes (e.g., a cat's face). The overfitted model (e.g., CoOp) struggles to transfer learned knowledge to unseen classes. (b) Performance comparison of existing methods and ours on base and novel classes.
  • Figure 2: An overview of our method. Subfigure (a) shows the training pipeline of our methhod, where the text prompts and vision prompts are learnable. Subfigure (b) presents the FIF process pf our method. Through FIF, the high-attention areas of the image are filtered out. Subfigure (c) presents the alignment strategy of our method. By STP and HLD, we preserve CLIP’s original generalization ability by maintaining its topological structure and does not constrain the prompt’s adaptability to downstream tasks
  • Figure 3: The effect of mask thresholds and sensitivity analysis of $\gamma$ and $\lambda$.