Table of Contents
Fetching ...

Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models

Yabin Zhang, Wenjie Zhu, Hui Tang, Zhiyuan Ma, Kaiyang Zhou, Lei Zhang

TL;DR

This paper proposes the dual memory networks that comprise dynamic and static memory components that comprise dynamic and static memory components that enhance model performance in the few-shot setting and enables model usability in the absence of training data.

Abstract

With the emergence of pre-trained vision-language models like CLIP, how to adapt them to various downstream classification tasks has garnered significant attention in recent research. The adaptation strategies can be typically categorized into three paradigms: zero-shot adaptation, few-shot adaptation, and the recently-proposed training-free few-shot adaptation. Most existing approaches are tailored for a specific setting and can only cater to one or two of these paradigms. In this paper, we introduce a versatile adaptation approach that can effectively work under all three settings. Specifically, we propose the dual memory networks that comprise dynamic and static memory components. The static memory caches training data knowledge, enabling training-free few-shot adaptation, while the dynamic memory preserves historical test features online during the testing process, allowing for the exploration of additional data insights beyond the training set. This novel capability enhances model performance in the few-shot setting and enables model usability in the absence of training data. The two memory networks employ the same flexible memory interactive strategy, which can operate in a training-free mode and can be further enhanced by incorporating learnable projection layers. Our approach is tested across 11 datasets under the three task settings. Remarkably, in the zero-shot scenario, it outperforms existing methods by over 3\% and even shows superior results against methods utilizing external training data. Additionally, our method exhibits robust performance against natural distribution shifts. Codes are available at \url{https://github.com/YBZh/DMN}.

Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models

TL;DR

This paper proposes the dual memory networks that comprise dynamic and static memory components that comprise dynamic and static memory components that enhance model performance in the few-shot setting and enables model usability in the absence of training data.

Abstract

With the emergence of pre-trained vision-language models like CLIP, how to adapt them to various downstream classification tasks has garnered significant attention in recent research. The adaptation strategies can be typically categorized into three paradigms: zero-shot adaptation, few-shot adaptation, and the recently-proposed training-free few-shot adaptation. Most existing approaches are tailored for a specific setting and can only cater to one or two of these paradigms. In this paper, we introduce a versatile adaptation approach that can effectively work under all three settings. Specifically, we propose the dual memory networks that comprise dynamic and static memory components. The static memory caches training data knowledge, enabling training-free few-shot adaptation, while the dynamic memory preserves historical test features online during the testing process, allowing for the exploration of additional data insights beyond the training set. This novel capability enhances model performance in the few-shot setting and enables model usability in the absence of training data. The two memory networks employ the same flexible memory interactive strategy, which can operate in a training-free mode and can be further enhanced by incorporating learnable projection layers. Our approach is tested across 11 datasets under the three task settings. Remarkably, in the zero-shot scenario, it outperforms existing methods by over 3\% and even shows superior results against methods utilizing external training data. Additionally, our method exhibits robust performance against natural distribution shifts. Codes are available at \url{https://github.com/YBZh/DMN}.
Paper Structure (14 sections, 12 equations, 11 figures, 6 tables)

This paper contains 14 sections, 12 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Illustration of the classification accuracy, (test-time) training GFLOPs, and learning parameters on zero-shot and 16-shot ImageNet classification. The icon sizes denote the number of learnable parameters. Our method is unique in its ability to work for all three task settings with superior results.
  • Figure 2: An illustration of the overall framework of our Dual Memory Networks (DMN), which integrates knowledge from three sources (i.e., text input, historical test data, and optional training images) to tackle the three types of adaptation tasks (i.e., zero-shot, few-shot, and the recently-proposed training-free few-shot adaptations).
  • Figure 3: Training-free few-shot results with a ResNet50 backbone. Full results on $11$ classification datasets are presented in Fig. \ref{['Fig:full_tf_res50']}.
  • Figure 4: Few-shot performance with ViTB/16 backbone, where the full results on $11$ classification datasets are presented in Fig. \ref{['Fig:full_tr_vit']}.
  • Figure 5: Few-shot performance with ResNet50 backbone, where the full results on $11$ classification datasets are presented in Fig. \ref{['Fig:full_tr_res50']}.
  • ...and 6 more figures