Table of Contents
Fetching ...

Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization

Jinlong Li, Dong Zhao, Zequn Jie, Elisa Ricci, Lin Ma, Nicu Sebe

TL;DR

An orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization, while a self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR is introduced.

Abstract

Efficient fine-tuning of vision-language models (VLMs) like CLIP for specific downstream tasks is gaining significant attention. Previous works primarily focus on prompt learning to adapt the CLIP into a variety of downstream tasks, however, suffering from task overfitting when fine-tuned on a small data set. In this paper, we introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization, while a self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR. Specifically, trainable orthogonal matrices are injected seamlessly into the transformer architecture and enforced with orthogonality constraint during the training, benefiting from the norm-preserving property and thus leading to stable and faster convergence, while keeping the pre-trained weights frozen. To alleviate deviation from fine-tuning, a self-regularization strategy is further employed to retain the generalization of the model during the training within a bypass manner. In addition, to enrich the sample diversity for downstream tasks under the small dataset scenario, we first explore attentive CutOut data augmentation to boost the efficient fine-tuning, leading to better model fitting capacity for specific downstream task. Then we support the theoretical analysis on how our approach improves the specific downstream performance and maintains the generalizability. For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario on par with the elaborated prompt learning methods.

Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization

TL;DR

An orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization, while a self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR is introduced.

Abstract

Efficient fine-tuning of vision-language models (VLMs) like CLIP for specific downstream tasks is gaining significant attention. Previous works primarily focus on prompt learning to adapt the CLIP into a variety of downstream tasks, however, suffering from task overfitting when fine-tuned on a small data set. In this paper, we introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization, while a self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR. Specifically, trainable orthogonal matrices are injected seamlessly into the transformer architecture and enforced with orthogonality constraint during the training, benefiting from the norm-preserving property and thus leading to stable and faster convergence, while keeping the pre-trained weights frozen. To alleviate deviation from fine-tuning, a self-regularization strategy is further employed to retain the generalization of the model during the training within a bypass manner. In addition, to enrich the sample diversity for downstream tasks under the small dataset scenario, we first explore attentive CutOut data augmentation to boost the efficient fine-tuning, leading to better model fitting capacity for specific downstream task. Then we support the theoretical analysis on how our approach improves the specific downstream performance and maintains the generalizability. For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario on par with the elaborated prompt learning methods.
Paper Structure (20 sections, 1 theorem, 33 equations, 3 figures, 7 tables)

This paper contains 20 sections, 1 theorem, 33 equations, 3 figures, 7 tables.

Key Result

Theorem 1

Assume that $\Theta ^{*}$ is the solution to Eq. equation eqn_optim. Then we have that for any $0<\epsilon <1$ with probability $1-\epsilon$, where $X^{*}=\mathrm{max}_{r\in \mathbb{N}_{N}}\left | \mathcal{L}\left( \hat{s}_{r}^{S}\left( \Theta \right), y_{r}^{gt} \right) \right |$ and $\alpha > 0$.

Figures (3)

  • Figure 1: The pipeline comparison for tuning or adapting VLMs into downstream tasks. Our contribution is to introduce a new fine-tuning pipeline by orthogonal tuning, that boost the CLIP and CoOp with competitive base/novel accuracy performances when compared with existing methods (results are computed by average 11 datasets).
  • Figure 2: Overview of our proposed pipeline, OrthSR. The top shows our fine-tuning pipeline by applying orthogonal tuning into the Feed-Forward-Network of both image and text encoder ($\mathcal{F}_v$ and $\mathcal{F}_t$) of CLIP model which is trained with Self-Regularization strategy. On the left of bottom, orthogonal matrix injection is explained by injecting orthogonal matrix into the pretrained weights with orthogonalization constraint (such as Cayley parameterization). On the right of bottom, pretrained CLIP is utilized to highlight the most-discriminative image regions and then apply cutout operation to obtain cutout image $X_{cutout}$ which will be input to the fine-tuned model together with original $X$.
  • Figure 3: Ablations in terms of $\lambda_1$ and $\lambda_2$.

Theorems & Definitions (2)

  • Theorem 1
  • proof