Table of Contents
Fetching ...

Compositional Kronecker Context Optimization for Vision-Language Models

Kun Ding, Xiaohui Li, Qiang Yu, Ying Wang, Haojian Zhang, Shiming Xiang

TL;DR

This work addresses the generalization limitations of prompt-tuning in vision-language models by introducing Compositional Kronecker Context Optimization (CK-CoOp). CK-CoOp constructs context prompts from a compressed base dictionary of token embeddings and augments them with a Kronecker-structured bias, enabling expressive yet compact representations. Across base-to-new, domain, and cross-task evaluations on diverse datasets and backbones, CK-CoOp achieves state-of-the-art or competitive performance while substantially reducing parameters and training/inference time compared with prior methods such as CoOp, CoCoOp, and ProGrad. The approach yields practical benefits for rapid, scalable adaptation of VLMs in real-world tasks, with ablations confirming the value of the compositional structure and Kronecker bias.

Abstract

Context Optimization (CoOp) has emerged as a simple yet effective technique for adapting CLIP-like vision-language models to downstream image recognition tasks. Nevertheless, learning compact context with satisfactory base-to-new, domain and cross-task generalization ability while adapting to new tasks is still a challenge. To tackle such a challenge, we propose a lightweight yet generalizable approach termed Compositional Kronecker Context Optimization (CK-CoOp). Technically, the prompt's context words in CK-CoOp are learnable vectors, which are crafted by linearly combining base vectors sourced from a dictionary. These base vectors consist of a non-learnable component obtained by quantizing the weights in the token embedding layer, and a learnable component constructed by applying Kronecker product on several learnable tiny matrices. Intuitively, the compositional structure mitigates the risk of overfitting on training data by remembering more pre-trained knowledge. Meantime, the Kronecker product breaks the non-learnable restrictions of the dictionary, thereby enhancing representation ability with minimal additional parameters. Extensive experiments confirm that CK-CoOp achieves state-of-the-art performance under base-to-new, domain and cross-task generalization evaluation, but also has the metrics of fewer learnable parameters and efficient training and inference speed.

Compositional Kronecker Context Optimization for Vision-Language Models

TL;DR

This work addresses the generalization limitations of prompt-tuning in vision-language models by introducing Compositional Kronecker Context Optimization (CK-CoOp). CK-CoOp constructs context prompts from a compressed base dictionary of token embeddings and augments them with a Kronecker-structured bias, enabling expressive yet compact representations. Across base-to-new, domain, and cross-task evaluations on diverse datasets and backbones, CK-CoOp achieves state-of-the-art or competitive performance while substantially reducing parameters and training/inference time compared with prior methods such as CoOp, CoCoOp, and ProGrad. The approach yields practical benefits for rapid, scalable adaptation of VLMs in real-world tasks, with ablations confirming the value of the compositional structure and Kronecker bias.

Abstract

Context Optimization (CoOp) has emerged as a simple yet effective technique for adapting CLIP-like vision-language models to downstream image recognition tasks. Nevertheless, learning compact context with satisfactory base-to-new, domain and cross-task generalization ability while adapting to new tasks is still a challenge. To tackle such a challenge, we propose a lightweight yet generalizable approach termed Compositional Kronecker Context Optimization (CK-CoOp). Technically, the prompt's context words in CK-CoOp are learnable vectors, which are crafted by linearly combining base vectors sourced from a dictionary. These base vectors consist of a non-learnable component obtained by quantizing the weights in the token embedding layer, and a learnable component constructed by applying Kronecker product on several learnable tiny matrices. Intuitively, the compositional structure mitigates the risk of overfitting on training data by remembering more pre-trained knowledge. Meantime, the Kronecker product breaks the non-learnable restrictions of the dictionary, thereby enhancing representation ability with minimal additional parameters. Extensive experiments confirm that CK-CoOp achieves state-of-the-art performance under base-to-new, domain and cross-task generalization evaluation, but also has the metrics of fewer learnable parameters and efficient training and inference speed.
Paper Structure (13 sections, 8 equations, 3 figures, 8 tables)

This paper contains 13 sections, 8 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Framework of the proposed CK-CoOp. Different from CoOp CoOp, CK-CoOp imposes a compositional constraint on the context. The left part in the dashed box corresponds to the process of generating the constrained context proposed in this work. Removing this part results in the original CoOp.
  • Figure 2: (a) Effect of $\alpha$ on base-to-new, domain and cross-task generalization performance. (b) Sorted $L_2$ norm of columns of bias matrix in logarithmic scale. (c) Visualization of various matrices in CK-CoOp. In (a), 'DG:S', 'DG:T', 'DG:mean' denote domain generalization performance on source domain, domain generalization performance on target domain, and overall domain generalization performance. 'CT:S', 'CT:T', 'CT:mean' denote cross-task generalization performance on source task, cross-task generalization performance on target task, and overall cross-task generalization performance.
  • Figure 3: Statistics of max cosine similarities.