Compositional Kronecker Context Optimization for Vision-Language Models

Kun Ding; Xiaohui Li; Qiang Yu; Ying Wang; Haojian Zhang; Shiming Xiang

Compositional Kronecker Context Optimization for Vision-Language Models

Kun Ding, Xiaohui Li, Qiang Yu, Ying Wang, Haojian Zhang, Shiming Xiang

TL;DR

This work addresses the generalization limitations of prompt-tuning in vision-language models by introducing Compositional Kronecker Context Optimization (CK-CoOp). CK-CoOp constructs context prompts from a compressed base dictionary of token embeddings and augments them with a Kronecker-structured bias, enabling expressive yet compact representations. Across base-to-new, domain, and cross-task evaluations on diverse datasets and backbones, CK-CoOp achieves state-of-the-art or competitive performance while substantially reducing parameters and training/inference time compared with prior methods such as CoOp, CoCoOp, and ProGrad. The approach yields practical benefits for rapid, scalable adaptation of VLMs in real-world tasks, with ablations confirming the value of the compositional structure and Kronecker bias.

Abstract

Context Optimization (CoOp) has emerged as a simple yet effective technique for adapting CLIP-like vision-language models to downstream image recognition tasks. Nevertheless, learning compact context with satisfactory base-to-new, domain and cross-task generalization ability while adapting to new tasks is still a challenge. To tackle such a challenge, we propose a lightweight yet generalizable approach termed Compositional Kronecker Context Optimization (CK-CoOp). Technically, the prompt's context words in CK-CoOp are learnable vectors, which are crafted by linearly combining base vectors sourced from a dictionary. These base vectors consist of a non-learnable component obtained by quantizing the weights in the token embedding layer, and a learnable component constructed by applying Kronecker product on several learnable tiny matrices. Intuitively, the compositional structure mitigates the risk of overfitting on training data by remembering more pre-trained knowledge. Meantime, the Kronecker product breaks the non-learnable restrictions of the dictionary, thereby enhancing representation ability with minimal additional parameters. Extensive experiments confirm that CK-CoOp achieves state-of-the-art performance under base-to-new, domain and cross-task generalization evaluation, but also has the metrics of fewer learnable parameters and efficient training and inference speed.

Compositional Kronecker Context Optimization for Vision-Language Models

TL;DR

Abstract

Paper Structure (13 sections, 8 equations, 3 figures, 8 tables)

This paper contains 13 sections, 8 equations, 3 figures, 8 tables.

Introduction
Related Work
Method
Preliminary
CK-CoOp
Experiments
Setup
Base-to-New Generalization
Domain Generalization
Cross-Task Generalization
Multi-dimensional Comparison
Ablation Study
Conclusion

Figures (3)

Figure 1: Framework of the proposed CK-CoOp. Different from CoOp CoOp, CK-CoOp imposes a compositional constraint on the context. The left part in the dashed box corresponds to the process of generating the constrained context proposed in this work. Removing this part results in the original CoOp.
Figure 2: (a) Effect of $\alpha$ on base-to-new, domain and cross-task generalization performance. (b) Sorted $L_2$ norm of columns of bias matrix in logarithmic scale. (c) Visualization of various matrices in CK-CoOp. In (a), 'DG:S', 'DG:T', 'DG:mean' denote domain generalization performance on source domain, domain generalization performance on target domain, and overall domain generalization performance. 'CT:S', 'CT:T', 'CT:mean' denote cross-task generalization performance on source task, cross-task generalization performance on target task, and overall cross-task generalization performance.
Figure 3: Statistics of max cosine similarities.

Compositional Kronecker Context Optimization for Vision-Language Models

TL;DR

Abstract

Compositional Kronecker Context Optimization for Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)